En-Ru translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
The models were trained using the MCSQ social surveys dataset (available at https://repo.clarino.uib.no/xmlui/bitstream/handle/11509/142/mcsq_v3.zip).
Their main use should be in-domain translation of social surveys.
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on MCSQ test set (BLEU):
en->ru: 64.3 (train: genuine in-domain MCSQ data)
ru->en: 74.7 (train: additional backtranslated in-domain MCSQ data)
(Evaluated using multeval: https://github.com/jhclark/multeval)
Stanislava Fedrová, Jan Hejk, Alice Jedličková (edd.)., Obsahuje bibliografie a bibliografické odkazy, and Část. slovenský text, anglická, německé a ruské resumé
The NottDeuYTSch corpus contains over 33 million words taken from approximately 3 million YouTube comments from videos published between 2008 to 2018 targeted at a young, German-speaking demographic and represents an authentic language snapshot of young German speakers. The corpus was proportionally sampled based on video category and year from a database of 112 popular German-speaking YouTube channels in the DACH region for optimal representativeness and balance and contains a considerable amount of associated metadata for each comment that enable further longitudinal cross-sectional analyses.