Number of results to display per page
Search Results
612. Manual Arabic spelling-errors correction for collected documents
- Creator:
- Saty, Ahmed, Aouragh, Si Lhoussain, and Bouzoubaa, Karim
- Publisher:
- Sudan University of Science and Technology
- Type:
- text and corpus
- Subject:
- Manual Arabic spelling-errors correction for collected documents
- Language:
- English and Arabic
- Description:
- The file represents a text corpus in the context of Arabic spell checking, where a group of persons edited different files, and all of the committed spelling errors by these persons have been recorded. A comprehensive representation these persons’ profile has been considered: male, female, old-aged, middle-aged, young-aged, high and low computer usage users, etc. Through this work, we aim to help researchers and those interested in Arabic NLP by providing them with an Arabic spell check corpus ready and open to exploitation and interpretation. This study also enabled the inventory of most spelling mistakes made by editors of Arabic texts. This file contains the following sections (tags): people – documents they printed – types of possible errors – errors they made. Each section (tag) contains some data that explains its details and its content, which helps researchers extracting research-oriented results. The people section contains basic information about each person and its relationship of using the computer, while the documents section clarifies all sentences in each document with the numbering of each sentence to be used in the errors section that was committed. We are also adding the “type of errors” section in which we list all the possible errors with their description in the Arabic language and give an illustrative example.
- Rights:
- Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB
613. Manual Re-evaluation of Translation Quality of WMT 2018 English-Czech systems
- Creator:
- Popel, Martin, Tomková, Markéta, and Tomek, Jakub
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- machine translation, manual evaluation, fluency, adequacy, and Translation Turing test
- Language:
- Czech and English
- Description:
- This data set contains four types of manual annotation of translation quality, focusing on the comparison of human and machine translation quality (aka human-parity). The machine translation system used is English-Czech CUNI Transformer (CUBBITT). The annotations distinguish adequacy, fluency and overall quality. One of the types is Translation Turing test - detecting whether the annotators can distinguish human from machine translation. All the sentences are taken from the English-Czech test set newstest2018 (WMT2018 News translation shared task www.statmt.org/wmt18/translation-task.html), but only from the half with originally English sentences translated to Czech by a professional agency.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
614. Manually Classified Errors in Cs->Sk Translation
- Creator:
- Galuščáková, Petra and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- machine translation, errors classification, and CS-SK translation
- Language:
- Slovak and Czech
- Description:
- Manual classification of errors of Czech-Slovak translation according to the classification introduced by Vilar et al. [1]. First 50 sentences from WMT 2010 test set were translated by 5 MT systems (Česílko, Česílko2, Google Translate and two Moses setups) and MT errors were manually marked and classified. Classification was applied in MT systems comparison [3]. Reference translation is included. References: [1] David Vilar, Jia Xu, Luis Fernando D’Haro and Hermann Ney. Error Analysis of Machine Translation Output. In International Conference on Language Resources and Evaluation, pages 697-702. Genoa, Italy, May 2006. [2] http://matrix.statmt.org/test_sets/list [3] Ondřej Bojar, Petra Galuščáková, and Miroslav Týnovský. Evaluating Quality of Machine Translation from Czech to Slovak. In Markéta Lopatková, editor, Information Technologies - Applications and Theory, pages 3-9, September 2011 and This work has been supported by the grants Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003 of the Czech Republic)
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
615. Manually Classified Errors in En->Sk Translation
- Creator:
- Galuščáková, Petra and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- machine translation, errors classification, and EN-SK translation
- Language:
- Slovak and English
- Description:
- Manual classification of errors of English-Slovak translation according to the classification introduced by Vilar et al. [1]. 50 sentences randomly selected from WMT 2011 test set [2] were translated by 3 MT systems described in [3] and MT errors were manually marked and classified. Reference translation is included. References: [1] David Vilar, Jia Xu, Luis Fernando D’Haro and Hermann Ney. Error Analysis of Machine Translation Output. In International Conference on Language Resources and Evaluation, pages 697-702. Genoa, Italy, May 2006. [2] http://www.statmt.org/wmt11/evaluation-task.html [3] Petra Galuščáková and Ondřej Bojar. Improving SMT by Using Parallel Data of a Closely Related Language. In Human Language Technologies - The Baltic Perspective - Proceedings of the Fifth International Conference Baltic HLT 2012, volume 247 of Frontiers in AI and Applications, pages 58-65, Amsterdam, Netherlands, October 2012. IOS Press. and This work has been supported by the grant Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003 of the Czech Republic)
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
616. Manually Ranked Translation Outputs
- Creator:
- Bojar, Ondřej and Galuščáková, Petra
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- machine translation, evaluation, and manual ranking
- Language:
- Slovak and Czech
- Description:
- Manually ranked outputs of Czech-Slovak translations. Three annotators manually ranked outputs of five MT systems (Česílko, Česílko2, Google Translate and two Moses setups) on three data sets (100 sentences randomly selected from books, 100 sentences randomly selected from Acquis corpus and 50 first sentences from WMT 2010 test set). Ranking was applied in MT systems comparison in [1]. References: [1] Ondřej Bojar, Petra Galuščáková, and Miroslav Týnovský. Evaluating Quality of Machine Translation from Czech to Slovak. In Markéta Lopatková, editor, Information Technologies - Applications and Theory, pages 3-9, September 2011 and This work has been supported by the grant Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003 of the Czech Republic)
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
617. Many Czech References for 50 Sentences Selected from WMT11 Data
- Creator:
- Bojar, Ondřej, Macháček, Matouš, Tamchyna, Aleš, and Zeman, Daniel
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- machine translation, automatic machine translation evaluation, and reference translation
- Language:
- Czech
- Description:
- This dataset contains the whole set of very many Czech translations for 50 English source sentences coming from WMT11 test set (http://www.statmt.org/wmt11). In total, there are 15431447 Czech sentences, i.e. 300k reference translations per source English sentence on average, but the exact number greatly varies across sentences. You can find more details in included README file. If you use this dataset, please cite the following paper which describes the technique used to construct the Czech translations: Bojar Ondřej, Macháček Matouš, Tamchyna Aleš, Zeman Daniel: Scratching the Surface of Possible Translations. Lecture Notes in Computer Science, Vol. 8082, Text, Speech and Dialogue: 16th International Conference, TSD 2013. Proceedings, Copyright © Springer Verlag, Berlin / Heidelberg, ISBN 978-3-642-40584-6, ISSN 0302-9743, pp. 465-474, 2013, DOI: 10.1007/978-3-642-40585-3_59 and P406/11/1499 of the Grant Agency of the Czech Republic, FP7-ICT-2011-7-288487 (MosesCore) of the European Union and 1356213 of the Grant Agency of the Charles University
- Rights:
- Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0), http://creativecommons.org/licenses/by-sa/3.0/, and PUB
618. Mapping Czech Verbal Valency to PropBank Argument Labels: LREC2024 - verification data
- Creator:
- Hajič, Jan, Fučíková, Eva, Lopatková, Markéta, and Urešová, Zdeňka
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, computationalLexicon, and lexicalConceptualResource
- Subject:
- predicate argument structure, valency, syntax, semantics, semantic roles, PropBank, Prague Dependency Treebank, SynSemClass, and Unified Verb Index
- Language:
- English and Czech
- Description:
- Mapping table for the article Hajič et al., 2024: Mapping Czech Verbal Valency to PropBank Argument Labels, in LREC-COLING 2024, as preprocess by the algorithm described in the paper. This dataset i smeant for verification (replicatoin) purposes only. It will b manually processed further to arrive at a workable CzezchpropBank, to be used in Czech UMR annotation, to be further updated during the annotation. The resulting PropBank frame files fir Czech are expected to be available with some future releases of UMR, containing Czech UMR annotation, or separately.
- Rights:
- Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB
619. Marquesan corpus
- Publisher:
- Max Planck Institute for Psycholinguistics
- Type:
- corpus
- Language:
- Macedonian
- Description:
- Documentation of the Marquesan language and culture project (DoBeS project)
- Rights:
- Code of conduct
620. MCSQ Translation Models (en-de) (v1.0)
- Creator:
- Variš, Dušan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- tool and toolService
- Subject:
- machine translation, transformer, and neural machine translation
- Language:
- English and German
- Description:
- En-De translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). The models were trained using the MCSQ social surveys dataset (available at https://repo.clarino.uib.no/xmlui/bitstream/handle/11509/142/mcsq_v3.zip). Their main use should be in-domain translation of social surveys. Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on MCSQ test set (BLEU): en->de: 67.5 (train: genuine in-domain MCSQ data only) de->en: 75.0 (train: additional in-domain backtranslated MCSQ data) (Evaluated using multeval: https://github.com/jhclark/multeval)
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB