The CzEngClass synonym verb lexicon is a result of a project investigating semantic ‘equivalence’ of verb senses and their valency behavior in parallel Czech-English language resources, i.e., relating verb meanings with respect to contextually-based verb synonymy. The lexicon entries are linked to PDT-Vallex (http://hdl.handle.net/11858/00-097C-0000-0023-4338-F), EngVallex (http://hdl.handle.net/11858/00-097C-0000-0023-4337-2), CzEngVallex (http://hdl.handle.net/11234/1-1512), FrameNet (https://framenet.icsi.berkeley.edu/fndrupal/), VerbNet (http://verbs.colorado.edu/verbnet/index.html), PropBank (http://verbs.colorado.edu/%7Empalmer/projects/ace.html), Ontonotes (http://verbs.colorado.edu/html_groupings/), and Czech (http://hdl.handle.net/11858/00-097C-0000-0001-4880-3) and English Wordnets (https://wordnet.princeton.edu/). Part of the dataset are files reflecting annotators choices and agreement for assignment of verbs to classes.
The CzEngClass synonym verb lexicon is a result of a project investigating semantic ‘equivalence’ of verb senses and their valency behavior in parallel Czech-English language resources, i.e., relating verb meanings with respect to contextually-based verb synonymy. The lexicon entries are linked to PDT-Vallex (http://hdl.handle.net/11858/00-097C-0000-0023-4338-F), EngVallex (http://hdl.handle.net/11858/00-097C-0000-0023-4337-2), CzEngVallex (http://hdl.handle.net/11234/1-1512), FrameNet (https://framenet.icsi.berkeley.edu/fndrupal/), VerbNet (http://verbs.colorado.edu/verbnet/index.html), PropBank (http://verbs.colorado.edu/%7Empalmer/projects/ace.html), Ontonotes (http://verbs.colorado.edu/html_groupings/), and Czech (http://hdl.handle.net/11858/00-097C-0000-0001-4880-3) and English Wordnets (https://wordnet.princeton.edu/).
CzEngVallex is a bilingual valency lexicon of corresponding Czech and English verbs. It connects 20835 aligned valency frame pairs (verb senses) which are translations of each other, aligning their arguments as well. The CzEngVallex serves as a powerful, real-text-based database of frame-to-frame and subsequently argument-to-argument pairs and can be used for example for machine translation applications. It uses the data from the Prague Czech-English Dependency Treebank project (PCEDT 2.0, http://hdl.handle.net/11858/00-097C-0000-0015-8DAF-4) and it also takes advantage of two existing valency lexicons: PDT-Vallex for Czech and EngVallex for English, using the same view of valency (based on the Functional Generative Description theory). The CzEngVallex is available in an XML format in the LINDAT/CLARIN repository, and also in a searchable form (see the “More Apps” tab) interlinked with PDT-Vallex (http://hdl.handle.net/11858/00-097C-0000-0023-4338-F),EngVallex (http://hdl.handle.net/11858/00-097C-0000-0023-4337-2) and with examples from the PCEDT.
CzeSL-GEC is a corpus containing sentence pairs of original and corrected versions of Czech sentences collected from essays written by both non-native learners of Czech and Czech pupils with Romani background. To create this corpus, unreleased CzeSL-man corpus (http://utkl.ff.cuni.cz/learncorp/) was utilized. All sentences in the corpus are word tokenized.
We present DaMuEL, a large Multilingual Dataset for Entity Linking containing data in 53 languages. DaMuEL consists of two components: a knowledge base that contains language-agnostic information about entities, including their claims from Wikidata and named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART, MANUFACTURED); and Wikipedia texts with entity mentions linked to the knowledge base, along with language-specific text from Wikidata such as labels, aliases, and descriptions, stored separately for each language. The Wikidata QID is used as a persistent, language-agnostic identifier, enabling the combination of the knowledge base with language-specific texts and information for each entity. Wikipedia documents deliberately annotate only a single mention for every entity present; we further automatically detect all mentions of named entities linked from each document. The dataset contains 27.9M named entities in the knowledge base and 12.3G tokens from Wikipedia texts. The dataset is published under the CC BY-SA licence.
Danish Fungi 2020 (DF20) is a fine-grained dataset and benchmark. The dataset, constructed from observations submitted to the Danish Fungal Atlas, is unique in its taxonomy-accurate class labels, small number of errors, highly unbalanced long-tailed class distribution, rich observation metadata, and well-defined class hierarchy. DF20 has zero overlap with ImageNet, allowing unbiased comparison of models fine-tuned from publicly available ImageNet checkpoints.
The dataset has 1,604 different classes, with 248,466 training images and 27,608 test images.
The corpus contains Czech speech of laryngectomy patients recorded before a surgery causing their voice to be lost in order to preserve the voice which can be later used for personalized text-to-speech system. Individual utterances were selected from the language by a special algorithm to cover as much phonetic and prosodic features as possible.
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 38A from 1943 contains footage from the Days of Czech Youth event organised by the Board of Trustees for the Education of Youth from 11 to 12 September. A concert of three brass bands, led by Miloš Kuba, and the Kühn Children´s Choir was held on Peace Square at 5 pm on 11 September. A procession of the Board´s members set out from Peace Square and continued through the streets of Prague. The event culminated with a track and field championship at Strahov Stadium where the winners of district rounds competed against each other. The spectators were welcomed by General Secretary of the Board František Teuner. The programme included a dance performance by girls in folk costumes. The event concluded with a speech by Minister of Education and People´s Enlightenment and Chairman of the Board Emanuel Moravec, followed by a solemn oath "to the Führer and to the Fatherland".
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 35A from 1943 captures the mood of the District Youth Track and Field Championship for Ages 10-18, which was organised by the Board of Trustees for the Education of Youth in eighty towns of the Protectorate as part of the Days of Czech Youth event held from 28 to 29 August 1943. At the A. F. K. Stadium in Kolín nad Labem, approximately 1,500 athletes qualified for the Track and Field Championship of Bohemia and Moravia.
This corpus contains the text of De Latinae Linguae Reparatione authored by Marcus Antonius Sabellicus (1436–1506), annotated with respect to lemmas, part-of-speech tags, morphological features and syntactic dependencies according to the typological formalism of Universal Dependencies (UD).
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 17B from 1945 shows a competition for the best decorated Easter egg, which was organised by girls from the Moravian Slovak branch of the Board of Trustees for the Education of Youth as part of the youth service of honour. Local women artisans, skilled in the traditional techniques, helped them with painting and etching patterns on Easter eggs.
The Sequoia corpus is a set of 3,099 linguistically-annotated French sentences, originating from four sources (Europarl, European Agency Reports, French regional journal L'Est Républicain, and French wikipedia).
Several types of annotations were added over the years.
The current release comprises:
- parts-of-speech (SEQUOIA ANR-08-EMER-013 project)
- syntactic dependency trees
- deep syntactic dependency graphs (Deep sequoia project)
- multi-word expressions and named entities (PARSEME COST project and PARSEME-FR ANR-14-CERA-0001 project)
- coarse semantic tags for nouns (FrSemCor project)
See the deep sequoia page for a detailed description: https://deep-sequoia.inria.fr/
Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-2988). It contains additional deep-syntactic and semantic annotations. Version of Deep UD corresponds to the version of UD it is based on. Note however that some UD treebanks have been omitted from Deep UD.
Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3105). It contains additional deep-syntactic and semantic annotations. Version of Deep UD corresponds to the version of UD it is based on. Note however that some UD treebanks have been omitted from Deep UD.
Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3226). It contains additional deep-syntactic and semantic annotations. Version of Deep UD corresponds to the version of UD it is based on. Note however that some UD treebanks have been omitted from Deep UD.
Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3424). It contains additional deep-syntactic and semantic annotations. Version of Deep UD corresponds to the version of UD it is based on. Note however that some UD treebanks have been omitted from Deep UD.
Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3687). It contains additional deep-syntactic and semantic annotations. Version of Deep UD corresponds to the version of UD it is based on. Note however that some UD treebanks have been omitted from Deep UD.
The dataset contains delimitation of borders of dialect regions, subgroups, areas and types in the Czech Republic. It is the result of an extensive expert revision that was based on various sources and made the delimitation exact and accurate. At the same time, the dataset corresponds to the underlying data of the Mapka application running at https://korpus.cz/mapka/
There are four files in this submission. Two files contain the delimitation of dialect regions ("oblasti"; both in GeoJSON and Shapefile formats) and two files contain the delimitation of smaller dialect areas, i.e. subgroups, areas and types ("oblasti_jemne"; again in GeoJSON and Shapefile formats).
Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia).
Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia).
Changes in version 1.1:
1. Universal Dependencies tagset instead of the older and smaller Google Universal POS tagset.
2. SVM classifier trained on Universal Dependencies 1.2 instead of HamleDT 2.0.
3. Balto-Slavic languages, Germanic languages and Romance languages were tagged by classifier trained only on the respective group of languages. Other languages were tagged by a classifier trained on all available languages. The "c7" combination from version 1.0 is no longer used.
The segment of Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel), 1938, issue no. 43A captures the demobilisation of the Czechoslovak Army after 9 October 1938. Conscripted soldiers are handing in their equipment and leaving the barracks.
DeriNet is a lexical network which contains derivational relations in Czech modeled as an oriented graph. Nodes correspond to Czech lexemes (a lexeme is a single lemma, possibly with only a subset of its senses – homonyms may have different derivations and are thus represented by several lexemes) and edges represent derivations between them. DeriNet 1.0 contains 968,967 lexemes with 965,535 unique lemmas; connected by 715,729 derivational links. Lexemes in DeriNet 1.0 are sampled from the MorfFlex dictionary.
DeriNet is a lexical network which models derivational relations in the lexicon of Czech. Nodes of the network correspond to Czech lexemes (i.e. single lemmas, possibly with only a subset of their senses), edges represent derivational relations between a derived word and its base word. The present version, DeriNet 1.2, contains 1,003,590 lexemes (sampled from the MorfFlex dictionary) with 1,001,394 unique lemmas, connected by 740,750 derivational links. Both rather technical and linguistic changes were made as compared to the previous version of the data; e.g. new version of the MorfFlex dictionary was used, derived words that contain a consonant and/or vowel alternation (e.g. boží) were connected with their base word (bůh).
DeriNet is a lexical network which models derivational relations in the lexicon of Czech. Nodes of the network correspond to Czech lexemes, while edges represent derivational relations between a derived word and its base word. The present version, DeriNet 1.5, contains 1,011,965 lexemes (sampled from the MorfFlex dictionary) connected by 785,543 derivational links. Besides several rather conservative updates (such as newly identified prefix and suffix verb-to-verb derivations as well as noun-to-adjective derivations manifested by most frequent adjectival suffixes), DeriNet 1.5 is the first version that contains annotations related to compounding (compound words are distinguished by a special mark in their part-of-speech labels).
DeriNet is a lexical network which models derivational relations in the lexicon of Czech. Nodes of the network correspond to Czech lexemes, while edges represent derivational relations between a derived word and its base word. The present version, DeriNet 1.6, contains 1,027,832 lexemes (sampled from the MorfFlex dictionary) connected by 803,404 derivational links. Furthermore, starting with version 1.5, DeriNet contains annotations related to compounding (compound words are distinguished by a special mark in their part-of-speech labels).
Compared to version 1.5, version 1.6 was expanded by extracting potential links from dictionaries available under suitable licences, such as Wiktionary, and by enlarging the number of marked compounds.
DeriNet is a lexical network which models derivational relations in the lexicon of Czech. Nodes of the network correspond to Czech lexemes, while edges represent derivational or compositional relations between a derived word and its base word / words. The present version, DeriNet 2.0, contains 1,027,665 lexemes (sampled from the MorfFlex dictionary) connected by 808682 derivational and 600 compositional links.
Compared to previous versions, version 2.0 uses a new format and contains new types of annotations: compounding, annotation of several morphological and other categories of lexemes, identification of root morphs of 244,198 lexemes, semantic labelling of 151,005 relations using five labels and identification of 13 fictitious lexemes.
DeriNet is a lexical network which models derivational relations in the lexicon of Czech. Nodes of the network correspond to Czech lexemes, while edges represent word-formational relations between a derived word and its base word / words. The present version, DeriNet 2.1, contains 1,039,012 lexemes (sampled from the MorfFlex CZ 2.0 dictionary) connected by 782,814 derivational, 50,533 orthographic variant, 1,952 compounding, 295 univerbation and 144 conversion relations.
Compared to the previous version, version 2.1 contains annotations of orthographic variants, full automatically generated annotation of affix morpheme boundaries (in addition to the roots annotated in 2.0), 202 affixoid lexemes serving as bases for compounding, annotation of corpus frequency of lexemes, annotation of verbal conjugation classes and a pilot annotation of univerbation. The set of part-of-speech tags was converted to Universal POS from the Universal Dependencies project.
Diachronic corpus of Czech sized 3.45 million words (i.e. 4.1 million tokens). It contains 116 texts from the 14th-20th century period. The texts are transcribed, not transliterated. Diakorp v6 is provided in a CoNLL-U-like vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query interface to the registered users of CNC at http://www.korpus.cz
Phonological neighborhood density is known to influence lexical access, speech production as well as perception processes. Lexical competition is thought to be the central concept from which the neighborhood effect emanates: highly competitive neighborhoods are characterized by large degrees of phonemic co-activation, which can delay speech recognition and facilitate speech production. The present study investigates phonetic learning in English as a foreign language in relation to phonological neighborhood density and onset density to see whether dense or sparse neighborhoods are more conducive to the incorporation of novel phonetic detail. In addition, the effect of voice-contrasted minimal pairs (bat-pat) is explored. Results indicate that sparser neighborhoods with weaker lexical competition provide the most optimal phonological environment for phonetic learning. Moreover, novel phonetic details are incorporated faster in neighborhoods without minimal pairs. Results indicate that lexical competition plays a role in the dissemination of phonetic updates in the lexicon of foreign language learners.
Titles of courses possibly relevant to the Digital Humanities for 2017-2018, manually gathered from course catalogues of most Czech state colleges, including the names of the teachers, department and school names, and the school-unique course IDs. All this information was publicly available in the individual course catalogues accessed from the official websites of the individual colleges.
The aim of the course is to introduce digital humanities and to describe various aspects of digital content processing.
The course consists of 10 lessons with video material and a PowerPoint presentation with the same content.
Every lesson contains a practical session – either a Jupyter Notebook to work in Python or a text file with a short description of the task. Most of the practical tasks consist of running the programme and analyse the results.
Although the course does not focus on programming, the code can be reused easily in individual projects.
Some experience in running Python code is desirable but not required.
The data set includes training, development and test data from the shared tasks on pronoun-focused machine translation and cross-lingual pronoun prediction from the EMNLP 2015 workshop on Discourse in Machine Translation (DiscoMT2015). The release also contains the submissions to the pronoun-focused machine translation along with the manual annotations used for the official evaluation as well as gold-standard annotations of pronoun coreference for the shared task test set.
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 48B from 1943 is about an event of the Board of Trustees for the Education of Youth called Sewing Dolls, which was part of the mandatory service. Girls, supervised by instructors, made toys out of pieces of cloth for the children of the labourers working in the Reich.
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 52A, B from 1944 shows the distribution of Christmas presents to poor children, which was organised by Social Aid in collaboration with the Board of Trustees for the Education of Youth and took place in the Great Hall of Lucerna Palace on 18 December. The event was attended by Minister of Education and People´s Enlightenment and Chairman of the Board Emanuel Moravec.
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 24B from 1944 was shot during the Days of Czech Youth organised by the Board of Trustees for the Education of Youth. The national event was preceded by districts rounds. The segment depicts the event that took place in Kolín nad Labem. The official procession through the streets of Kolín was followed by the District Track and Field Championship at the A.F.K. Stadium. Girls in folk costumes danced to folk songs.
DOESTE v0.5 is a set of developmental corpora of texts written by Brazilian and Portuguese school-age children and adolescents. It is a work in progress.
The texts written by monolingual children and adolescents in European Portuguese were collected between September 2011 and January 2012, from different public schools in Lisbon (Portugal). It is composed of 244 narrative (n=122) and argumentative (n=122) texts. The subjects (51% female and 49% male) are students enroled in the 5th grade (n=52; mean age=10.19), in the 7th grade (n=92; mean age=12.33) and in the 10th grade (n=100; mean age=15.16) from the Portuguese basic schooling. The subcorpus of Portuguese texts is fully tokenized and morphologically annotated, in addition to presenting the sentence occurrences.
The texts written by monolingual children and adolescents in Brazilian Portuguese have been collected since 2017, from different public schools in three cities in Rio Grande do Norte (Brazil). It is currently composed of narrative (n=225) and argumentative (n=225) texts. The subjects (53% female and 47% male) are students enroled in the 5th grade (n=68; mean age=11.13), in the 9th grade (n=82; mean age=15.32) and in the 12th grade (n=224; mean age=17.96) from the Brazilian basic schooling. The subcorpora of Brazilian texts is still in the compilation, but a large part is already searchable, being tokenized and morphologically annotated. The Brazilian subcorpus also presents itself with the original transcripts, along original images.
Portuguese and Brazilian texts were collected from similar tasks:
Narrative-based task: Tell a remarkable story (real or imagined) that you and your best friend lived during the last school vacation.
Argumentative based-task: Do you think social networks (Facebook, Twitter, Google+, Windows Live Space, etc.) are important today? Write a text to be published on your school's blog where you express your opinion on social networks. In this text, you must say whether you are for or against the existence of social networks. Don't forget to justify your opinion!
The next version of DOESTE intends to present semantic annotations and clause and t-unit segmentation.
DOESTE v0.5 is developed and maintained by the Educational Linguistics Research Group (LEd), based at the Federal Rural University of the Semiarid Region (UFERSA).
DOESTE v0.5 by Mário Martins et al. is licensed under CC BY-NC-ND 4.0.
Modifications to DSpace made by Petr Pajas in order to support pidconsortium.eu PID handle system instead of the default handle.com system used by DSpace.
DZ Interset is a means of converting among various tag sets in natural language processing. The core idea is similar to interlingua-based machine translation. DZ Interset defines a set of features that are encoded by the various tag sets. The set of features should be as universal as possible. It does not need to encode everything that is encoded by any tag set but it should encode all information that people may want to access and/or port from one tag set to another.
New tag sets are attached by writing a driver for them. Once the driver is ready, you can easily convert tags between the new set and any other set for which you also have a driver. This reusability is an obvious advantage over writing a targeted conversion procedure each time you need to convert between a particular pair of tag sets. and grant MSM 0021620838 of the Ministry of Education of the Czech Republic
The segment of Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel), 1938, issue no. 34 captures a speech given by Eddy Sherwood, an American journalist, Protestant missionary and YMCA official, in which he talks about the brave stance adopted by the Czechoslovak nation in the critical days of 1938.
Footage of actor Eduard Kohout admiring actors´ faces engraved in glass on the premises of the Secondary School of Decorative Arts in Prague in a fragmented segment from Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) 1942, issue no. 28.
The segment of Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel), 1938, issue no. 38 offers an excerpt from the radio speech delivered by President Edvard Beneš on 10 September 1938, in which he addresses the German minority in Czechoslovakia.
Extremely fast digital audio channelizer implementation, usable as a building block for experimental ASR front-ends or signal denoising applications. Also applicable in software defined radios, due to its high throughput. It comes in a form of a C/C++ library and an executable example program which reads input stream, splitting it into equidistant frequency channels, emitting their data to the output.
Features:
(1) Hand tuned SIMD-aware assembly for x86 (SSE) and IA64 (AVX) as well as for ARM (NEON) processors.
(2) Generic non-SIMD C++ implementation for other architectures.
(3) Capable of taking advantage of multicore CPUs.
(4) Fully configurable number of channels and the output decimation rate.
(5) User supplied FIR of the channel separation filter, which allows to specify the width of the channels, whether they should overlap or be separated.
(6) Input and output signal samples are treated as complex numbers.
(7) Speed over 750 complex MS/s achieved on Core i7 4710HQ @ 2.5GHz, when channelizing into 72 output channels with a FIR length of 1152 samples, using 3 computing threads.
(8) Runs under Linux OS.
Journalist Egon Erwin Kisch in edited footage from post-war newsreel segments of his obituary in Týden ve filmu (Week in Film) 1948, issue no. 15. Kisch with the Mayor of Prague Václav Vacek at an exhibition.
The segment of Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel), 1938, issue no. 52 reports on the elections to the Slovak Land Assembly, the highest legislative body of autonomous Slovakia within the framework of the Second Czechoslovak Republic, which were held on 18 December 1938. The footage shows preparations for the distribution of ballots at the Town Hall in Bratislava.
ELITR Minuting Corpus consists of transcripts of meetings in Czech and English, their manually created summaries ("minutes") and manual alignments between the two.
Czech meetings are in the computer science and public administration domains and English meetings are in the computer science domain.
Each transcript has one or multiple corresponding minutes files. Alignments are only provided for a portion of the data.
This corpus contains 59 Czech and 120 English meeting transcripts, consisting of 71097 and 87322 dialogue turns respectively. For Czech meetings, we provide 147 total minutes with 55 of them aligned. For English meetings, it is 256 total minutes with 111 of them aligned.
Please find a more detailed description of the data in the included README and stats.tsv files.
If you use this corpus, please cite:
Nedoluzhko, A., Singh, M., Hledíková, M., Ghosal, T., and Bojar, O.
(2022). ELITR Minuting Corpus: A novel dataset for automatic minuting
from multi-party meetings in English and Czech. In Proceedings of the
13th International Conference on Language Resources and Evaluation
(LREC-2022), Marseille, France, June. European Language Resources
Association (ELRA). In print.
@inproceedings{elitr-minuting-corpus:2022,
author = {Anna Nedoluzhko and Muskaan Singh and Marie
Hled{\'{\i}}kov{\'{a}} and Tirthankar Ghosal and Ond{\v{r}}ej Bojar},
title = {{ELITR} {M}inuting {C}orpus: {A} Novel Dataset for
Automatic Minuting from Multi-Party Meetings in {E}nglish and {C}zech},
booktitle = {Proceedings of the 13th International Conference
on Language Resources and Evaluation (LREC-2022)},
year = 2022,
month = {June},
address = {Marseille, France},
publisher = {European Language Resources Association (ELRA)},
note = {In print.}
}
Composer and actor Eman Fiala with his cousin, conductor Jiří Julius Fiala, on Bohumil Veselý's balcony. Eman Fiala in various roles, dancing at a ball, and conducting an orchestra.
A view of the house no. 20 on Karlova Street, Prague. Sculptor Emanuel Kodet working in his studio. A view of the nearby Astronomical Tower at the Clementinum. His brother, sculptor Jan Kodet, working in his studio in a segment from Československý filmový týdeník (Czechoslovak Film Weekly Newsreel) 1958, issue no. 35. Jan Kodet with an unidentified man.
Historian Emanuel Svoboda at a gathering to mark the anniversary of the birth of Mikoláš Aleš in a fragmented segment from Československé filmové noviny (Czechoslovak Film News) 1952, issue no. 46. Svoboda on Bohumil Veselý's balcony. Svoboda with his wife Maryna (née Alšová) and an unidentified man on a park bench.
Playwright Emil Artur Longen in Otrávené světlo (The Poisoned Light, dir. Jan S. Kolár, Karel Lamač, 1921) and Rudi na záletech (Rudi The Unfaithful, dir. Emil Artur Longen, Antonín Pech, 1911).
Emil František Burian during a guest performance in Zlín in a segment from Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) 1939, issue no. 41B. Burian at his desk in a segment from Československý filmový týdeník (Czechoslovak Film Weekly Newsreel) 1954, issue no. 25.
Athlete Emil Zátopek wins the 5,000-metre race at the 1952 Summer Olympics in Helsinki in a segment from Československé filmové noviny (Czechoslovak Film News) 1952, issue no. 37. In 1952, he also wins the 5,000-metre race in Opava in a segment from Československé filmové noviny (Czechoslovak Film News) 1952, issue no. 42. Zátopek with his wife Dana Zátopková at Strahov Stadium in Prague. Zátopek accepting the Order of the Republic from the hands of Minister of Defence Alexej Čepička in 1952 in a segment from Československé filmové noviny (Czechoslovak Film News) 1952, issue no. 42.
Eyetracked Multi-Modal Translation (EMMT) is a simultaneous eye-tracking, 4-electrode EEG and audio corpus for multi-modal reading and translation scenarios. It contains monocular eye movement recordings, audio data and 4-electrode wearable electroencephalogram (EEG) data of 43 participants while engaged in sight translation supported by an image.
The details about the experiment and the dataset can be found in the README file.
The corpus presented consists of job ads in Spanish related to Engineering positions in Peru.
The documents were preprocessed and annotated for POS tagging, NER, and topic modeling tasks.
The corpus is divided in two components:
- POS tagging/ NER training data: Consisting of 800 job ads, each one tokenized and manually annotated with POS tag information (EAGLE format) and Entity Label in BIO format.
- Topic modeling training data: containing 9000 documents stripped from stopwords. Comes in two formats:
* Whole text documents: containing all the information originally posted in the ad.
* Extracted chunks documents: containing chunks extracted by custom NER models (expected skills, tasks to perform, and preferred major), as described in Improving Topic Coherence Using Entity Extraction Denoising (to appear)
Professor and Minister of Finance Karel Engliš in a garden in a segment from Československý filmový týdeník (Czechoslovak Film Weekly Newsreel) 1934, issue no. 10. Engliš, first on his own and later with literary critic Miloslav Hýsek on Bohumil Veselý's balcony.
Data collection has been done by the means of Sketch Engine program.
Data were extrapolated from the annotated English web corpus enTenTen20.
Data collection and analysis has been done during the period of two months: April and May 2023.
Recently, the enTenTen20 corpus has been updated to a newer version - enTenTen21. Nevertheless, the older version is still available, can be worked on and can be compared with the newer one. It has been noticed that the differences between the two versions of the English web corpus did not affect the results of this study. The only apparent difference was seen in slightly different numbers in frequency values for specific collocations. This was expected since the older version of web corpus consists of 36 billion words, while the new version counts 52 billion words. On the other hand, as noted above, these frequency deviations were not significant enough to refute the hypotheses. They have rather confirmed them once again.
This study is one of the results of work on a larger scientific-research project called "Metaphorical collocations - syntagmatic relations between semantics and pragmatics". More information about the project is available on the following link: https://metakol.uniri.hr/en/opis-projekta/
The study has been financed by the Croatian science foundation.
Working with the data/replicating the study:
Data collected for the purposes of this study is available in CSV format.
Data for each gustatory adjective (collocate) is presented in a separate CSV file.
Upon opening each file, stretch the borders of every column for better visibility of data.
Tables show different collocational bases (nouns) which are found in the corpus, in combination with a specific gustatory adjective, their collocate.
These nouns are listed by their score number (The Mutual Information score expresses the extent to which words co-occur compared to the number of times they appear separately).
Tables show what type of mapping is present in a certain collocation (e.g., intra-modal or cross-modal).
Tables show what type of meaning or cognitive process is working in the background of the meaning formation (e.g., metonymic or metaphoric).
For every analyzed collocation, we provided a contextualized example of its use from the corpus, along with the hyperlink where it can be found.
English model for NameTag, a named entity recognition tool. The model is trained on CoNLL-2003 training data. Recognizes PER, ORG, LOC and MISC named entities. Achieves F-measure 84.73 on CoNLL-2003 test data.
English models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging.
The morphological dictionary is created from Morphium and SCOWL (Spell Checker Oriented Word Lists), the PoS tagger is trained on WSJ (Wall Street Journal). and This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013).
The morphological POS analyzer development was supported by grant of the Ministry of Education, Youth and Sports of the Czech Republic No. LC536 "Center for Computational Linguistics". The morphological POS analyzer research was performed by Johanka Spoustová (Spoustová 2008; the Treex::Tool::EnglishMorpho::Analysis Perl module). The lemmatizer was implemented by Martin Popel (Popel 2009; the Treex::Tool::EnglishMorpho::Lemmatizer Perl module). The lemmatizer is based on morpha, which was released under LGPL licence as a part of RASP system (http://ilexir.co.uk/applications/rasp).
The tagger algorithm and feature set research was supported by the projects MSM0021620838 and LC536 of Ministry of Education, Youth and Sports of the Czech Republic, GA405/09/0278 of the Grant Agency of the Czech Republic and 1ET101120503 of Academy of Sciences of the Czech Republic. The research was performed by Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab and Miroslav Spousta.
The corpus contains recordings of male speaker, native in Czech, talking in English. The sentences that were read by the speaker originate in the domain of air traffic control (ATC), specifically the messages used by plane pilots during routine flight. The text in the corpus originates from the transcripts of the real recordings, part of which has been released in LINDAT/CLARIN (http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0), and individual phrases were selected by special algorithm described in Jůzová, M. and Tihelka, D.: Minimum Text Corpus Selection for Limited Domain Speech Synthesis (DOI 10.1007/978-3-319-10816-2_48). The corpus was used to create a limited domain speech synthesis system capable of simulating a pilot communication with an ATC officer.
The corpus contains recordings of male speaker, native in German, talking in English. The sentences that were read by the speaker originate in the domain of air traffic control (ATC), specifically the messages used by plane pilots during routine flight. The text in the corpus originates from the transcripts of the real recordings, part of which has been released in LINDAT/CLARIN (http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0), and individual phrases were selected by special algorithm described in Jůzová, M. and Tihelka, D.: Minimum Text Corpus Selection for Limited Domain Speech Synthesis (DOI 10.1007/978-3-319-10816-2_48). The corpus was used to create a limited domain speech synthesis system capable of simulating a pilot communication with an ATC officer.
The corpus contains recordings of male speaker, native in Serbian, talking in English. The sentences that were read by the speaker originate in the domain of air traffic control (ATC), specifically the messages used by plane pilots during routine flight. The text in the corpus originates from the transcripts of the real recordings, part of which has been released in LINDAT/CLARIN (http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0), and individual phrases were selected by special algorithm described in Jůzová, M. and Tihelka, D.: Minimum Text Corpus Selection for Limited Domain Speech Synthesis (DOI 10.1007/978-3-319-10816-2_48). The corpus was used to create a limited domain speech synthesis system capable of simulating a pilot communication with an ATC officer.
The corpus contains recordings of male speaker, native in Taiwanese, talking in English. The sentences that were read by the speaker originate in the domain of air traffic control (ATC), specifically the messages used by plane pilots during routine flight. The text in the corpus originates from the transcripts of the real recordings, part of which has been released in LINDAT/CLARIN (http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0), and individual phrases were selected by special algorithm described in Jůzová, M. and Tihelka, D.: Minimum Text Corpus Selection for Limited Domain Speech Synthesis (DOI 10.1007/978-3-319-10816-2_48). The corpus was used to create a limited domain speech synthesis system capable of simulating a pilot communication with an ATC officer.
Sentence-parallel corpus made from English and Czech Wikipedias based on translated articles from English into Czech.
The work done is described in the paper: ŠTROMAJEROVÁ, Adéla, Vít BAISA a Marek BLAHUŠ. Between Comparable and Parallel: English-Czech Corpus from Wikipedia. In RASLAN 2016 Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016. s. 3-8, 6 s. ISBN 978-80-263-1095-2.
English-Hindi parallel corpus collected from several sources. Tokenized and sentence-aligned. A part of the data is our patch for the Emille parallel corpus. and FP7-ICT-2007-3-231720 (EuroMatrix Plus) 7E09003 (Czech part of EM+)
English-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] – EMEA, EUConst, KDE4 and PHP) and downloaded website of European Commission [5]. Corpus is published in both in plaintext format and with an automatic morphological annotation.
References:
[1] http://langtech.jrc.it/JRC-Acquis.html/
[2] http://www.statmt.org/europarl/
[3] http://apertium.eu/data
[4] http://opus.lingfil.uu.se/
[5] http://ec.europa.eu/ and This work has been supported by the grant Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003 of the Czech Republic)
English-Urdu parallel corpus is a collection of religious texts (Quran, Bible) in English and Urdu language with sentence alignments. The corpus can be used for experiments with statistical machine translation. Our modifications of crawled data include but are not limited to the following:
1- Manually corrected sentence alignment of the corpora.
2- Our data split (training-development-test) so that our published experiments can be reproduced.
3- Tokenization (optional, but needed to reproduce our experiments).
4- Normalization (optional) of e.g. European vs. Urdu numerals, European vs. Urdu punctuation, removal of Urdu diacritics.
EngVallex is the English counterpart of the PDT-Vallex valency lexicon, using the same view of valency, valency frames and the description of a surface form of verbal arguments. EngVallex contains links also to PropBank and Verbnet, two existing English predicate-argument lexicons used, i.a., for the PropBank project. The EngVallex lexicon is fully linked to the English side of the PCEDT parallel treebank, which is in fact the PTB re-annotated using the Prague Dependency Treebank style of annotation. The EngVallex is available in an XML format in our repository, and also in a searchable form with examples from the PCEDT.
EngVallex 2.0 as a slightly updated version of EngVallex. It is the English counterpart of the PDT-Vallex valency lexicon, using the same view of valency, valency frames and the description of a surface form of verbal arguments. EngVallex contains links also to PropBank (English predicate-argument lexicon). The EngVallex lexicon is fully linked to the English side of the PCEDT parallel treebank(s), which is in fact the PTB re-annotated using the Prague Dependency Treebank style of annotation. The EngVallex is available in an XML format in our repository, and also in a searchable form with examples from the PCEDT. EngVallex 2.0 is the same dataset as the EngVallex lexicon packaged with the PCEDT 3.0 corpus, but published separately under a more permissive licence, avoiding the need for LDC licence which is tied to PCEDT 3.0 as a whole.
Enriched discourse annotation of a subset of the Prague Discourse Treebank, adding implicit relations, entity based relations, question-answer relations and other discourse structuring phenomena.