The CzEngClass synonym verb lexicon is a result of a project investigating semantic ‘equivalence’ of verb senses and their valency behavior in parallel Czech-English language resources, i.e., relating verb meanings with respect to contextually-based verb synonymy. The lexicon entries are linked to PDT-Vallex (http://hdl.handle.net/11858/00-097C-0000-0023-4338-F), EngVallex (http://hdl.handle.net/11858/00-097C-0000-0023-4337-2), CzEngVallex (http://hdl.handle.net/11234/1-1512), FrameNet (https://framenet.icsi.berkeley.edu/fndrupal/), VerbNet (http://verbs.colorado.edu/verbnet/index.html), PropBank (http://verbs.colorado.edu/%7Empalmer/projects/ace.html), Ontonotes (http://verbs.colorado.edu/html_groupings/), and Czech (http://hdl.handle.net/11858/00-097C-0000-0001-4880-3) and English Wordnets (https://wordnet.princeton.edu/).
CzEngVallex is a bilingual valency lexicon of corresponding Czech and English verbs. It connects 20835 aligned valency frame pairs (verb senses) which are translations of each other, aligning their arguments as well. The CzEngVallex serves as a powerful, real-text-based database of frame-to-frame and subsequently argument-to-argument pairs and can be used for example for machine translation applications. It uses the data from the Prague Czech-English Dependency Treebank project (PCEDT 2.0, http://hdl.handle.net/11858/00-097C-0000-0015-8DAF-4) and it also takes advantage of two existing valency lexicons: PDT-Vallex for Czech and EngVallex for English, using the same view of valency (based on the Functional Generative Description theory). The CzEngVallex is available in an XML format in the LINDAT/CLARIN repository, and also in a searchable form (see the “More Apps” tab) interlinked with PDT-Vallex (http://hdl.handle.net/11858/00-097C-0000-0023-4338-F),EngVallex (http://hdl.handle.net/11858/00-097C-0000-0023-4337-2) and with examples from the PCEDT.
ELITR Minuting Corpus consists of transcripts of meetings in Czech and English, their manually created summaries ("minutes") and manual alignments between the two.
Czech meetings are in the computer science and public administration domains and English meetings are in the computer science domain.
Each transcript has one or multiple corresponding minutes files. Alignments are only provided for a portion of the data.
This corpus contains 59 Czech and 120 English meeting transcripts, consisting of 71097 and 87322 dialogue turns respectively. For Czech meetings, we provide 147 total minutes with 55 of them aligned. For English meetings, it is 256 total minutes with 111 of them aligned.
Please find a more detailed description of the data in the included README and stats.tsv files.
If you use this corpus, please cite:
Nedoluzhko, A., Singh, M., Hledíková, M., Ghosal, T., and Bojar, O.
(2022). ELITR Minuting Corpus: A novel dataset for automatic minuting
from multi-party meetings in English and Czech. In Proceedings of the
13th International Conference on Language Resources and Evaluation
(LREC-2022), Marseille, France, June. European Language Resources
Association (ELRA). In print.
@inproceedings{elitr-minuting-corpus:2022,
author = {Anna Nedoluzhko and Muskaan Singh and Marie
Hled{\'{\i}}kov{\'{a}} and Tirthankar Ghosal and Ond{\v{r}}ej Bojar},
title = {{ELITR} {M}inuting {C}orpus: {A} Novel Dataset for
Automatic Minuting from Multi-Party Meetings in {E}nglish and {C}zech},
booktitle = {Proceedings of the 13th International Conference
on Language Resources and Evaluation (LREC-2022)},
year = 2022,
month = {June},
address = {Marseille, France},
publisher = {European Language Resources Association (ELRA)},
note = {In print.}
}
English model for NameTag, a named entity recognition tool. The model is trained on CoNLL-2003 training data. Recognizes PER, ORG, LOC and MISC named entities. Achieves F-measure 84.73 on CoNLL-2003 test data.
English-Urdu parallel corpus is a collection of religious texts (Quran, Bible) in English and Urdu language with sentence alignments. The corpus can be used for experiments with statistical machine translation. Our modifications of crawled data include but are not limited to the following:
1- Manually corrected sentence alignment of the corpora.
2- Our data split (training-development-test) so that our published experiments can be reproduced.
3- Tokenization (optional, but needed to reproduce our experiments).
4- Normalization (optional) of e.g. European vs. Urdu numerals, European vs. Urdu punctuation, removal of Urdu diacritics.
EngVallex is the English counterpart of the PDT-Vallex valency lexicon, using the same view of valency, valency frames and the description of a surface form of verbal arguments. EngVallex contains links also to PropBank and Verbnet, two existing English predicate-argument lexicons used, i.a., for the PropBank project. The EngVallex lexicon is fully linked to the English side of the PCEDT parallel treebank, which is in fact the PTB re-annotated using the Prague Dependency Treebank style of annotation. The EngVallex is available in an XML format in our repository, and also in a searchable form with examples from the PCEDT.
EngVallex 2.0 as a slightly updated version of EngVallex. It is the English counterpart of the PDT-Vallex valency lexicon, using the same view of valency, valency frames and the description of a surface form of verbal arguments. EngVallex contains links also to PropBank (English predicate-argument lexicon). The EngVallex lexicon is fully linked to the English side of the PCEDT parallel treebank(s), which is in fact the PTB re-annotated using the Prague Dependency Treebank style of annotation. The EngVallex is available in an XML format in our repository, and also in a searchable form with examples from the PCEDT. EngVallex 2.0 is the same dataset as the EngVallex lexicon packaged with the PCEDT 3.0 corpus, but published separately under a more permissive licence, avoiding the need for LDC licence which is tied to PCEDT 3.0 as a whole.
Enriched discourse annotation of a subset of the Prague Discourse Treebank, adding implicit relations, entity based relations, question-answer relations and other discourse structuring phenomena.
ESIC (Europarl Simultaneous Interpreting Corpus) is a corpus of 370 speeches (10 hours) in English, with manual transcripts, transcribed simultaneous interpreting into Czech and German, and parallel translations.
The corpus contains source English videos and audios. The interpreters' voices are not published within the corpus, but there is a tool that downloads them from the web of European Parliament, where they are publicly avaiable.
The transcripts are equipped with metadata (disfluencies, mixing voices and languages, read or spontaneous speech, etc.), punctuated, and with word-level timestamps.
The speeches in the corpus come from the European Parliament plenary sessions, from the period 2008-11. Most of the speakers are MEP, both native and non-native speakers of English. The corpus contains metadata about the speakers (name, surname, id, fraction) and about the speech (date, topic, read or spontaneous).
The current version of ESIC is v1.0. It has validation and evaluation parts.
ESIC (Europarl Simultaneous Interpreting Corpus) is a corpus of 370 speeches (10 hours) in English, with manual transcripts, transcribed simultaneous interpreting into Czech and German, and parallel translations.
The corpus contains source English videos and audios. The interpreters' voices are not published within the corpus, but there is a tool that downloads them from the web of European Parliament, where they are publicly avaiable.
The transcripts are equipped with metadata (disfluencies, mixing voices and languages, read or spontaneous speech, etc.), punctuated, and with word-level timestamps.
The speeches in the corpus come from the European Parliament plenary sessions, from the period 2008-11. Most of the speakers are MEP, both native and non-native speakers of English. The corpus contains metadata about the speakers (name, surname, id, fraction) and about the speech (date, topic, read or spontaneous).
ESIC has validation and evaluation parts.
The current version is ESIC v1.1, it extends v1.0 with manual sentence alignment of the tri-parallel texts, and with bi-parallel sentence alignment of English original transcripts and German interpreting.