The `corpipe23-corefud1.1-231206` is a `mT5-large`-based multilingual model for coreference resolution usable in CorPipe 23 (https://github.com/ufal/crac2023-corpipe). It is released under the CC BY-NC-SA 4.0 license.
The model is language agnostic (no _corpus id_ on input), so it can be used to predict coreference in any `mT5` language (for zero-shot evaluation, see the paper). However, note that the empty nodes must be present already on input, they are not predicted (the same settings as in the CRAC23 shared task).
Corpus of texts in 12 languages. For each language, we provide one training, one development and one testing set acquired from Wikipedia articles. Moreover, each language dataset contains (substantially larger) training set collected from (general) Web texts. All sets, except for Wikipedia and Web training sets that can contain similar sentences, are disjoint. Data are segmented into sentences which are further word tokenized.
All data in the corpus contain diacritics. To strip diacritics from them, use Python script diacritization_stripping.py contained within attached stripping_diacritics.zip. This script has two modes. We generally recommend using method called uninames, which for some languages behaves better.
The code for training recurrent neural-network based model for diacritics restoration is located at https://github.com/arahusky/diacritics_restoration.
CUBBITT En-Fr translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on newstest2014 (BLEU):
en->fr: 38.2
fr->en: 36.7
(Evaluated using multeval: https://github.com/jhclark/multeval)
Maurice Maeterlinck., Přeloženo z francouzštiny, "Přeložil Jaroslav Zaorálek"--Tiráž, Kromě vydání na papíře "Antik" vydáno bylo 300 číslovaných výtisků na papíře "Japan Banzay", z nichž výtisky číslo 1-25 jsou M. Maeterlinckem podepsány, and Číslovaný výtisk 61
introduction de Karel Čapek ; traduit de Jos. Hrdinová., Obsahuje přívazky : Rudolf Kremlička : trente six reproduction -- Václav Špála : siebenunddreisig Reproduktionen -- František Kupka : třicet dvě reprodukce, and Zadní desky nejsou k dispozici, přední desky z MVS.
Josef Páta., KČSN., S otiskem 10 dopisů a lužických dodatků k Srovnávacímu ruskému slovníku Kateřiny II., Obsahuje bibliografické odkazy a rejstřík., and Část. německý text a francouzské resumé