Number of results to display per page
Search Results
382. Croatian National Corpus
- Publisher:
- University of Zagreb, Faculty of Humanities and Social Sciences
- Type:
- corpus
- Language:
- Croatian
- Description:
- This is the reference corpus of standard Croatian. In its 3.0 version, which is accessible via noSketch Engine, it has 216.8 million tokens. In terms of annotation, the corpus is tokenised, lemmatised and tagged for MSDs (morphosyntactic descriptions).
- Rights:
- Not specified
383. Croatian-English Parallel Corpus
- Publisher:
- University of Zagreb, Faculty of Humanities and Social Sciences
- Type:
- corpus
- Language:
- Croatian and English
- Description:
- written; domain-specific (newspaper); synchronic; bilingual; parallel; unidirectional; XML; S-alignment
- Rights:
- Not specified
384. CsEnVi Pairwise Parallel Corpora
- Creator:
- Hoang, Duc Tam and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- corpus, Vietnamese, parallel corpus, Czech-Vietnamese corpus, and English-Vietnamese corpus
- Language:
- Czech, English, and Vietnamese
- Description:
- CsEnVi Pairwise Parallel Corpora consist of Vietnamese-Czech parallel corpus and Vietnamese-English parallel corpus. The corpora were assembled from the following sources: - OPUS, the open parallel corpus is a growing multilingual corpus of translated open source documents. The majority of Vi-En and Vi-Cs bitexts are subtitles from movies and television series. The nature of the bitexts are paraphrasing of each other's meaning, rather than translations. - TED talks, a collection of short talks on various topics, given primarily in English, transcribed and with transcripts translated to other languages. In our corpus, we use 1198 talks which had English and Vietnamese transcripts available and 784 talks which had Czech and Vietnamese transcripts available in January 2015. The size of the original corpora collected from OPUS and TED talks is as follows: CS/VI EN/VI Sentence 1337199/1337199 2035624/2035624 Word 9128897/12073975 16638364/17565580 Unique word 224416/68237 91905/78333 We improve the quality of the corpora in two steps: normalizing and filtering. In the normalizing step, the corpora are cleaned based on the general format of subtitles and transcripts. For instance, sequences of dots indicate explicit continuation of subtitles across multiple time frames. The sequences of dots are distributed differently in the source and the target side. Removing the sequence of dots, along with a number of other normalization rules, improves the quality of the alignment significantly. In the filtering step, we adapt the CzEng filtering tool [1] to filter out bad sentence pairs. The size of cleaned corpora as published is as follows: CS/VI EN/VI Sentence 1091058/1091058 1113177/1091058 Word 6718184/7646701 8518711/8140876 Unique word 195446/59737 69513/58286 The corpora are used as training data in [2]. References: [1] Ondřej Bojar, Zdeněk Žabokrtský, et al. 2012. The Joy of Parallelism with CzEng 1.0. Proceedings of LREC2012. ELRA. Istanbul, Turkey. [2] Duc Tam Hoang and Ondřej Bojar, The Prague Bulletin of Mathematical Linguistics. Volume 104, Issue 1, Pages 75–86, ISSN 1804-0462. 9/2015
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
385. CST's lemmatiser
- Publisher:
- Center for Sprogteknologi, University of Copenhagen
- Type:
- toolService
- Language:
- Danish, Dutch, English, German, Modern Greek (1453-), Icelandic, Norwegian, Russian, Slovenian, and Swedish
- Description:
- 1) Fully automatic rule based lemmatization of inflected languages 2) Fully automatic training of lemmatization rules based on full form-lemma list
- Rights:
- Not specified
386. CST's lemmatizer
- Creator:
- Jongejan, Bart
- Publisher:
- Københavns Universitet, Center for Sprogteknologi (CST)
- Type:
- toolService
- Description:
- 1) Fully automatic rule based lemmatization of inflected languages 2) Fully automatic training of lemmatization rules based on full form-lemma list
- Rights:
- Not specified
387. Ctibor Blatný (phytopathologist)
- Creator:
- Veselý, Bohumil
- Publisher:
- Národní filmový archiv
- Type:
- video and clip
- Subject:
- Galerie osobností, Places::Praha::Nové Město::Školská::pavlač domu, Places::Praha::Dejvice::Hanspaulka::vila Oldřicha Blažíčka /ext.,int./, and People::Blatný Ctibor (1897-1978)
- Language:
- No linguistic content
- Description:
- Professor and phytopathologist Ctibor Blatný on Bohumil Veselý's balcony.
- Rights:
- http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
388. CUBBITT Translation Models (en-cs) (v1.0)
- Creator:
- Popel, Martin, Tomková, Markéta, Tomek, Jakub, Kaiser, Łukasz, Uszkoreit, Jakob, Bojar, Ondřej, and Žabokrtský, Zdeněk
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- tool and toolService
- Subject:
- machine translation, neural machine translation, transformer, and cubbitt
- Language:
- English and Czech
- Description:
- CUBBITT En-Cs translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on newstest2014 (BLEU): en->cs: 27.6 cs->en: 34.4 (Evaluated using multeval: https://github.com/jhclark/multeval)
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
389. CUBBITT Translation Models (en-fr) (v1.0)
- Creator:
- Popel, Martin, Tomková, Markéta, Tomek, Jakub, Kaiser, Łukasz, Uszkoreit, Jakob, Bojar, Ondřej, and Žabokrtský, Zdeněk
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- tool and toolService
- Subject:
- machine translation, neural machine translation, transformer, and cubbitt
- Language:
- English and French
- Description:
- CUBBITT En-Fr translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on newstest2014 (BLEU): en->fr: 38.2 fr->en: 36.7 (Evaluated using multeval: https://github.com/jhclark/multeval)
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
390. CUBBITT Translation Models (en-pl) (v1.0)
- Creator:
- Popel, Martin, Tomková, Markéta, Tomek, Jakub, Kaiser, Łukasz, Uszkoreit, Jakob, Bojar, Ondřej, and Žabokrtský, Zdeněk
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- tool and toolService
- Subject:
- machine translation, neural machine translation, transformer, and cubbitt
- Language:
- English and Polish
- Description:
- CUBBITT En-Pl translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on newstest2020 (BLEU): en->pl: 12.3 pl->en: 20.0 (Evaluated using multeval: https://github.com/jhclark/multeval)
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB