A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
This corpus was originally created for performance testing (server infrastructure CorpusExplorer - see: diskurslinguistik.net / diskursmonitor.de). It includes the filtered database (German texts only) of CommonCrawl (as of March 2018). First, the URLs were filtered according to their top-level domain (de, at, ch). Then the texts were classified using NTextCat and only uniquely German texts were included in the corpus. The texts were then annotated using TreeTagger (token, lemma, part-of-speech). 2.58 million documents - 232.87 million sentences - 3.021 billion tokens. You can use CorpusExplorer (http://hdl.handle.net/11234/1-2634) to convert this data into various other corpus formats (XML, JSON, Weblicht, TXM and many more).
The `corpipe23-corefud1.1-231206` is a `mT5-large`-based multilingual model for coreference resolution usable in CorPipe 23 (https://github.com/ufal/crac2023-corpipe). It is released under the CC BY-NC-SA 4.0 license.
The model is language agnostic (no _corpus id_ on input), so it can be used to predict coreference in any `mT5` language (for zero-shot evaluation, see the paper). However, note that the empty nodes must be present already on input, they are not predicted (the same settings as in the CRAC23 shared task).
ESIC (Europarl Simultaneous Interpreting Corpus) is a corpus of 370 speeches (10 hours) in English, with manual transcripts, transcribed simultaneous interpreting into Czech and German, and parallel translations.
The corpus contains source English videos and audios. The interpreters' voices are not published within the corpus, but there is a tool that downloads them from the web of European Parliament, where they are publicly avaiable.
The transcripts are equipped with metadata (disfluencies, mixing voices and languages, read or spontaneous speech, etc.), punctuated, and with word-level timestamps.
The speeches in the corpus come from the European Parliament plenary sessions, from the period 2008-11. Most of the speakers are MEP, both native and non-native speakers of English. The corpus contains metadata about the speakers (name, surname, id, fraction) and about the speech (date, topic, read or spontaneous).
The current version of ESIC is v1.0. It has validation and evaluation parts.
ESIC (Europarl Simultaneous Interpreting Corpus) is a corpus of 370 speeches (10 hours) in English, with manual transcripts, transcribed simultaneous interpreting into Czech and German, and parallel translations.
The corpus contains source English videos and audios. The interpreters' voices are not published within the corpus, but there is a tool that downloads them from the web of European Parliament, where they are publicly avaiable.
The transcripts are equipped with metadata (disfluencies, mixing voices and languages, read or spontaneous speech, etc.), punctuated, and with word-level timestamps.
The speeches in the corpus come from the European Parliament plenary sessions, from the period 2008-11. Most of the speakers are MEP, both native and non-native speakers of English. The corpus contains metadata about the speakers (name, surname, id, fraction) and about the speech (date, topic, read or spontaneous).
ESIC has validation and evaluation parts.
The current version is ESIC v1.1, it extends v1.0 with manual sentence alignment of the tri-parallel texts, and with bi-parallel sentence alignment of English original transcripts and German interpreting.
The representative full-text digitalized HetWiK corpus is composed of 140 manually annotated texts of the German Resistance between 1933 and 1945. This includes both well-known and relatively unknown documents, public writings, like pamphlets or memoranda, as well as private texts, e.g. letters, journal or prison entries and biographies. Thus the corpus represents the diverse groups as well as the heterogeneity of verbal resistance and allows the study of resistance in relation to the language usage.
The HetWiK corpus can be used free of charge. A detailed register of the individual texts and further information about the tagset can be found on the project-homepage (german). In addition to the CATMA5 XML-format we provide a standoff-JSON format and CEC6-Files (CorpusExplorer) - so you can export the HetWiK corpus in different formats.
GeCzLex 1.0 is an online electronic resource for translation equivalents of Czech and German discourse connectives. It contains anaphoric connectives for both languages and their possible translations documented in bilingual parallel corpora (not necessarily anaphoric). The entries have been interlinked via semantic annotation of the connectives (taken from monolingual lexicons of connectives CzeDLex and DiMLex) according to the PDTB 3 sense taxonomy and translation possibilities aquired from the Czech and German parallel data of the Intercorp project. The lexicon is the first bilingual inventory of connectives with linkage on the level of individual pairs (connective + discourse sense).
En-De translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
The models were trained using the MCSQ social surveys dataset (available at https://repo.clarino.uib.no/xmlui/bitstream/handle/11509/142/mcsq_v3.zip).
Their main use should be in-domain translation of social surveys.
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on MCSQ test set (BLEU):
en->de: 67.5 (train: genuine in-domain MCSQ data only)
de->en: 75.0 (train: additional in-domain backtranslated MCSQ data)
(Evaluated using multeval: https://github.com/jhclark/multeval)
NER models for NameTag 2, named entity recognition tool, for English, German, Dutch, Spanish and Czech. Model documentation including performance can be found here: https://ufal.mff.cuni.cz/nametag/2/models . These models are for NameTag 2, named entity recognition tool, which can be found here: https://ufal.mff.cuni.cz/nametag/2 .
NER models for NameTag 2, named entity recognition tool, for English, German, Dutch, Spanish and Czech. Model documentation including performance can be found here: https://ufal.mff.cuni.cz/nametag/2/models . These models are for NameTag 2, named entity recognition tool, which can be found here: https://ufal.mff.cuni.cz/nametag/2 .