Number of results to display per page
Search Results
22. FAUST 0.5
- Creator:
- Hajič, Jan, Mareček, David, Fučíková, Eva, Cinková, Silvie, Štěpánek, Jan, and Mikulová, Marie
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- tectogrammatics, treebank, parallel corpus, and noisy texts
- Language:
- English and Czech
- Description:
- Syntactic (including deep-syntactic - tectogrammatical) annotation of user-generated noisy sentences. The annotation was made on Czech-English and English-Czech Faust Dev/Test sets. The English data includes manual annotations of English reference translations of Czech source texts. This texts were translated independently by two translators. After some necessary cleanings, 1000 segments were randomly selected for manual annotation. Both the reference translations were annotated, which means 2000 annotated segments in total. The Czech data includes manual annotations of Czech reference translations of English source texts. This texts were translated independently by three translators. After some necessary cleanings, 1000 segments were randomly selected for manual annotation. All three reference translations were annotated, which means 3000 annotated segments in total. Faust is part of PDT-C 1.0 (http://hdl.handle.net/11234/1-3185).
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
23. Khresmoi Query Translation Test Data 2.0
- Creator:
- Pecina, Pavel, Dušek, Ondřej, Hajič, Jan, Libovický, Jindřich, and Urešová, Zdeňka
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- corpus, test data, medical, health, machine translation, Czech, English, French, German, Hungarian, Polish, Spanish, and Swedish
- Language:
- Czech, English, French, German, Hungarian, Polish, Spanish, and Swedish
- Description:
- This package contains data sets for development and testing of machine translation of medical queries between Czech, English, French, German, Hungarian, Polish, Spanish ans Swedish. The queries come from general public and medical experts. This is version 2.0 extending the previous version by adding Hungarian, Polish, Spanish, and Swedish translations.
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
24. Khresmoi Summary Translation Test Data 2.0
- Creator:
- Dušek, Ondřej, Hajič, Jan, Hlaváčová, Jaroslava, Libovický, Jindřich, Pecina, Pavel, Tamchyna, Aleš, and Urešová, Zdeňka
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- corpus, test data, medical, health, machine translation, Czech, English, French, German, Hungarian, Polish, Spanish, and Swedish
- Language:
- Czech, English, French, German, Hungarian, Polish, Spanish, and Swedish
- Description:
- This package contains data sets for development (Section dev) and testing (Section test) of machine translation of sentences from summaries of medical articles between Czech, English, French, German, Hungarian, Polish, Spanish and Swedish. Version 2.0 extends the previous version by adding Hungarian, Polish, Spanish, and Swedish translations.
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
25. Manual Re-evaluation of Translation Quality of WMT 2018 English-Czech systems
- Creator:
- Popel, Martin, Tomková, Markéta, and Tomek, Jakub
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- machine translation, manual evaluation, fluency, adequacy, and Translation Turing test
- Language:
- Czech and English
- Description:
- This data set contains four types of manual annotation of translation quality, focusing on the comparison of human and machine translation quality (aka human-parity). The machine translation system used is English-Czech CUNI Transformer (CUBBITT). The annotations distinguish adequacy, fluency and overall quality. One of the types is Translation Turing test - detecting whether the annotators can distinguish human from machine translation. All the sentences are taken from the English-Czech test set newstest2018 (WMT2018 News translation shared task www.statmt.org/wmt18/translation-task.html), but only from the half with originally English sentences translated to Czech by a professional agency.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
26. MorfFlex CZ 2.0
- Creator:
- Hajič, Jan, Hlaváčová, Jaroslava, Mikulová, Marie, Straka, Milan, and Štěpánková, Barbora
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, lexicalConceptualResource, and computationalLexicon
- Subject:
- morphological dictionary, morphology, and Czech
- Language:
- Czech
- Description:
- MorfFlex CZ 2.0 is the Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. MorfFlex is a flat list of lemma-tag-wordform triples. For each wordform, full inflectional information is coded in a positional tag. Wordforms are organized into entries (paradigm instances or paradigms in short) according to their formal morphological behavior. The paradigm (set of wordforms) is identified by a unique lemma. Apart from traditional morphological categories, the description also contains some semantic, stylistic and derivational information. For more details see a comprehensive specification of the Czech morphological annotation http://ufal.mff.cuni.cz/techrep/tr64.pdf .
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
27. MorfFlex SK 170914
- Creator:
- Hajič, Jan and Hric, Jan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, computationalLexicon, and lexicalConceptualResource
- Subject:
- Slovak and morphological dictionary
- Language:
- Slovak
- Description:
- Slovak morphological dictionary modeled after the Czech one. It consists of (word form, lemma, POS tag) triples, reusing the Czech morphological system for POS tags and lemma descriptions.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
28. NameTag 2 Models (2020-08-31)
- Creator:
- Straková, Jana and Straka, Milan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, mlmodel, and languageDescription
- Subject:
- named entity recognition
- Language:
- English, German, Dutch, Spanish, and Czech
- Description:
- NER models for NameTag 2, named entity recognition tool, for English, German, Dutch, Spanish and Czech. Model documentation including performance can be found here: https://ufal.mff.cuni.cz/nametag/2/models . These models are for NameTag 2, named entity recognition tool, which can be found here: https://ufal.mff.cuni.cz/nametag/2 .
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
29. NomVallex I.
- Creator:
- Kolářová, Veronika, Vernerová, Anna, and Klímová, Jana
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, machineReadableDictionary, and lexicalConceptualResource
- Subject:
- valency, Czech, lexicon, syntax, semantics, nominal valency, and deverbal nouns
- Language:
- Czech
- Description:
- The NomVallex I. lexicon describes valency of Czech deverbal nouns belonging to three semantic classes, i.e. Communication (dotaz 'question'), Mental Action (plán 'plan') and Psych State (nenávist 'hatred'). It covers both stem-nominals and root-nominals (dotazování se 'asking' and dotaz 'question'). In total, the lexicon includes 505 lexical units in 248 lexemes. Valency properties are captured in the form of valency frames, specifying valency slots and their morphemic forms, and are exemplified by corpus examples. In order to facilitate comparison, this submission also contains abbreviated entries of the source verbs of these nouns from the Vallex lexicon and simplified entries of the covered nouns from the PDT-Vallex lexicon.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
30. OdiEnCorp 1.0
- Creator:
- Parida, Shantipriya and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- Odia English Parallel Corpus, Odia Monolingual Corpus, and English-Odia Machine Translation
- Language:
- Oriya (macrolanguage) and English
- Description:
- Data ---- We have collected English-Odia parallel and monolingual data from the available public websites for NLP research in Odia. The parallel corpus consists of English-Odia parallel Bible, Odia digital library, and Odisha Goverment websites. It covers bible, literature, goverment of Odisha and its policies. We have processed the raw data collected from the websites, performed alignments (a mix of manual and automatic alignments) and release the corpus in a form ready for various NLP tasks. The Odia monolingual data consists of Odia-Wikipedia and Odia e-magazine websites. Because the major portion of data is extracted from Odia-Wikipedia, it covers all kinds of domains. The e-magazines data mostly cover the literature domain. We have preprocessed the monolingual data including de-duplication, text normalization, and sentence segmentation to make it ready for various NLP tasks. Corpus Formats -------------- Both corpora are in simple tab-delimited plain text files. The parallel corpus files have three columns: - the original book/source of the sentence pair - the English sentence - the corresponding Odia sentence The monolingual corpus has a varying number of columns: - each line corresponds to one *paragraph* (or related unit) of the original source - each tab-delimited unit corresponds to one *sentence* in the paragraph Data Statistics ---------------- The statistics of the current release is given below. Parallel Corpus Statistics --------------------------- Dataset Sentences #English tokens #Odia tokens ------- --------- ---------------- ------------- Train 27136 706567 604147 Dev 948 21912 19513 Test 1262 28488 24365 ------- --------- ---------------- ------------- Total 29346 756967 648025 Domain Level Statistics ------------------------ Domain Sentences #English tokens #Odia tokens ------------------ --------- ---------------- ------------- Bible 29069 756861 640157 Literature 424 7977 6611 Goverment policies 204 1411 1257 ------------------ --------- ---------------- ------------- Total 29697 766249 648025 Monolingual Corpus Statistics ----------------------------- Paragraphs Sentences #Odia tokens ---------- --------- ------------ 71698 221546 2641308 Domain Level Statistics ----------------------- Domain Paragraphs Sentences #Odia tokens -------------- -------------- --------- ------------- General (wiki) 30468 (42.49%) 102085 1320367 Literature 41230 (57.50%) 119461 1320941 -------------- -------------- --------- ------------- Total 71698 221546 2641308 Citation -------- If you use this corpus, please cite it directly (see above), but please cite also the following paper: Title: OdiEnCorp: Odia-English and Odia-Only Corpus for Machine Translation Author: Shantipriya Parida, Ondrej Bojar, and Satya Ranjan Dash Proceedings of the Third International Conference on Smart Computing & Informatics (SCI) 2018 Series: Smart Innovation, Systems and Technologies (SIST) Publisher: Springer Singapore
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB