Harvested from: LINDAT/CLARIAH-CZ repository / Language: English - LINDAT/CLARIAH-CZ Catalog Search Results

Start Over Language English Harvested from LINDAT/CLARIAH-CZ repository

361. WMT18 Quality Estimation Shared Task Training and Development Data

Creator:: Specia, Lucia, Logacheva, Varvara, Blain, Frederic, Fernandez, Ramon, and Martins, André
Publisher:: University of Sheffield
Type:: text and corpus
Subject:: machine translation, quality estimation, and machine learning
Language:: English, German, Czech, and Latvian
Description:: Training and development data for the WMT18 QE task. Test data will be published as a separate item. This shared task will build on its previous six editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We include word-level, phrase-level and sentence-level estimation. All tasks make use of datasets produced from post-editions by professional translators. The datasets are domain-specific (IT and life sciences/pharma domains) and extend from those used previous years with more instances and more languages. One important addition is that this year we also include datasets with neural MT outputs. In addition to advancing the state of the art at all prediction levels, our specific goals are: To study the performance of quality estimation approaches on the output of neural MT systems. We will do so by providing datasets for two language language pairs where the same source segments are translated by both a statistical phrase-based and a neural MT system. To study the predictability of deleted words, i.e. words that are missing in the MT output. TO do so, for the first time we provide data annotated for such errors at training time. To study the effectiveness of explicitly assigned labels for phrases. We will do so by providing a dataset where each phrase in the output of a phrase-based statistical MT system was annotated by human translators. To study the effect of different language pairs. We will do so by providing datasets created in similar ways for four language language pairs. To investigate the utility of detailed information logged during post-editing. We will do so by providing post-editing time, keystrokes, and actual edits. Measure progress over years at all prediction levels. We will do so by using last year's test set for comparative experiments. In-house statistical and neural MT systems were built to produce translations for all tasks. MT system-dependent information can be made available under request. The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes. Participants are allowed to explore any additional data and resources deemed relevant.
Rights:: AGREEMENT ON THE USE OF DATA IN QT21, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB

362. Word Importance Dataset

Creator:: Osuský, Adam and Javorský, Dávid
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: word importance, ranking, and importance ranking
Language:: English
Description:: This dataset comprises a corpus of 50 text contexts, each about 60 words in length, sourced from five distinct domains. Each context has been evaluated by multiple annotators who identified and ranked the most important words—up to 10% of each text—according to their perceived significance. The annotators followed specific guidelines to ensure consistency in word selection and ranking. For further details, please refer to the cited source. --- rankings_task.csv - This csv contains information about the contexts which are to be annotated: - id: A unique identifier for each task. - content: The context to be ranked. --- rankings_ranking.csv - This csv includes ranking information for various assignments. It contains four columns: - id: A unique identifier for each ranking entry. - score: The score assigned to the entry. - word_order: A JSON detailing the order of words positions. It is essentially the selected word positions and their ordering from an annotator. - assignment_id: A reference ID linking to the assignments. --- rankings_assignment.csv - This csv tracks the completion status of tasks by users. It includes four columns: - id: A unique identifier for each assignment entry. - is_completed: A binary indicator (1 for completed, 0 for not completed). - task_id: A reference ID linking to the tasks. - user_id: The identifier for the user who should complete the task (rank the words). --- Known Issues: Please note that each annotator was intended to rank each context only once. However, due to a bug in the deployment of the annotation tool, some entries may be duplicated. Users of this dataset should be cautious of this issue and verify the uniqueness of the annotations where necessary. --- This dataset is a part of work from a bachelor thesis: OSUSKÝ, Adam. Predicting Word Importance Using Pre-Trained Language Models. Bachelor thesis, supervisor Javorský, Dávid. Prague: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, 2024.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

363. Word representations for multiple languages

Creator:: Müller, Thomas and Schütze, Hinrich
Publisher:: Center for Information and Language Processing, University of Munich
Type:: text and corpus
Subject:: morphological dictionary, morphological analysis, and PoS tagging
Language:: English, German, Latin, Hungarian, Spanish, and Czech
Description:: Dictionaries with different representations for various languages. Representations include brown clusters of different sizes and morphological dictionaries extracted using different morphological analyzers. All representations cover the most frequent 250,000 word types on the Wikipedia version of the respective language. Analzers used: MAGYARLANC (Hungarian, Zsibrita et al. (2013)), FREELING (English and Spanish, Padro and Stanilovsky (2012)), SMOR (German, Schmid et al. (2004)), an MA from Charles University (Czech, Hajic (2001)) and LATMOR (Latin, Springmann et al. (2014)).
Rights:: Creative Commons - Attribution 3.0 Unported (CC BY 3.0), http://creativecommons.org/licenses/by/3.0/, and PUB

364. Word-final /s/ durations in spoken German

Creator:: Luef, Eva Maria
Publisher:: Charles University and Universität Hamburg
Type:: text and corpus
Subject:: sibilant, acoustics, word-final s, duration, and German
Language:: English
Description:: German has various homophonous sibilant fricatives of phonemic or morphemic nature that can appear in word-final position. In English, the functional status of a word-final \s\ influences its durational properties, with phonemic \s\ being longer than morphemic types. The data set presented here is a small selection of laboratory-elicited German sentences containing various words with final sibilant phonemes (e.g., "das Haus") and morphemes (plural, genitive, clitic, inflection). Durations of the \s\ types were measured and compared across the conditions. An ANOVA between the \s\ types and post-hoc Tukey pair-wise comparisons are presented that show various significant differences. The submission consists of a csv data file, containing a number of variables, and a PDF document detailing the experiment and variables.
Rights:: Creative Commons - Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0), http://creativecommons.org/licenses/by-nd/4.0/, and PUB

365. WordSim353-cs: Evaluation Dataset for Lexical Similarity and Relatedness, based on WordSim353

Creator:: Cinková, Silvie, Straková, Jana, Hajič, Jakub, Hajič, Jan, Hajič, Jan, jr., Janoušková, Jolana, Straka, Milan, and Urešová, Miroslava
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, wordList, and lexicalConceptualResource
Subject:: lexical semantics, similarity, relatedness, evaluation, and distributional semantics
Language:: Czech and English
Description:: Czech translation of WordSim353. The Czech translation of English WordSim353 word pairs were obtained from four translators. All translation variants were scored according to the lexical similarity/relatedness annotation instructions for WordSim353 annotators, by 25 Czech annotators. The resulting data set consists of two annotation files: "WordSim353-cs.csv" and "WordSim-cs-Multi.csv". Both files are encoded in UTF-8, have a header, text is enclosed in double quotes, and columns are separated by commas. The rows are numbered. The WordSim-cs-Multi data set has rows numbered from 1 to 634, whereas the row indices in the WordSim353-cs data set reflect the corresponding row numbers in the WordSim-cs-Multi data set. The WordSim353-cs file contains a one-to-one mapping selection of 353 Czech equivalent pairs whose judgments have proven to be most similar to the judgments of their corresponding English originals (compared by the absolute value of the difference between the means over all annotators in each language counterpart). In one case ("psychology-cognition"), two Czech equivalent pairs had identical means as well as confidence intervals, so we randomly selected one. The "WordSim-cs-Multi.csv" file contains human judgments for all translation variants. In both data sets, we preserved all 25 individual scores. In the WordSim353-cs data set, we added a column with their Czech means as well as a column containing the original English means and 95% confidence intervals in separate columns for each mean (computed by the CI function in the Rmisc R package). The WordSim-cs-Multi data set contains only the Czech means and confidence intervals. For the most convenient lexical search, we provided separate columns with the respective Czech and English single words, entire word pairs, and eventually an English-Czech quadruple in both data sets. The data set also contains an xls table with the four translations and a preliminary selection of the best variants performed by an adjudicator.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

366. Wortschatz

Publisher:: University of Leipzig
Type:: corpus
Language:: Afrikaans, Albanian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Finnish, French, German, Hungarian, Icelandic, Indonesian, Italian, Japanese, Korean, Latin, Latvian, Lithuanian, Malay (macrolanguage), Norwegian, Occitan (post 1500), Romanian, Russian, Slovak, Slovenian, Spanish, Sundanese, Swedish, Tagalog, Turkish, Vietnamese, and Welsh
Description:: Collected from newspaper texts, webcrawling, etc.: words (+frequency), cooccurrences (+graph), left/right neighbours, example sentences
Rights:: Not specified

367. YAWA - Yet Another Word Aligner

Publisher:: Research Institute for Artificial Intelligence, Romanian Academy of Sciences
Type:: toolService
Subject:: word aligner
Language:: English and Romanian
Description:: YAWA is a four stage lexical aligner that uses bilingual translation lexicons produced by [[http://www.clarin.eu/tools/translation-equivalents-extractor|TREQ]] and phrase boundaries detection to align words of a given bitext. Using this alignment, in stage 2 a language dependent module takes over and produces alignments of the remaining lexical tokens within aligned chunks. Stage 3 is specialized in aligning blocks of consecutive unaligned tokens and stage 4 deletes alignments that are likely to be wrong. Developed in PERL, YAWA is language independent, except for the modules that realise alignments specific to the pairs of aligned languages. So far, it works just for Ro-En pair of languages. It requires a parallel corpus in [[http://www.xces.org|XCES]] format, morpho-syntactically annotated and lemmatized (using [[http://www.clarin.eu/tools/ttl-tokenizing-tagging-and-lemmatizing-free-running-texts|TTL]]), and translation dictionaries produced by [[http://www.clarin.eu/tools/translation-equivalents-extractor|TREQ]]. YAWA’s individual F-measure is 81.22%. Currently YAWA is a part of the [[http://www.clarin.eu/tools/cowal-combined-word-aligner|COWAL]] combined lexical alignment platform. More detailed descriptions are available in [[http://www.racai.ro/~tufis/papers|the following papers]]: -- Radu Ion (2007). Word Sense Disambiguation Methods Applied to English and Romanian. (in Romanian). PhD thesis. Romanian Academy, Bucharest -- Dan Tufiş (2007). Exploiting Aligned Parallel Corpora in Multilingual Studies and Applications. In Toru Ishida, Susan R. Fussell, and Piek T.J.M. Vossen (eds.), Intercultural Collaboration. First International Workshop (IWIC 2007), volume 4568 of Lecture Notes in Computer Science, pp. 103-117. Springer-Verlag, August 2007. ISBN 978-3-540-73999-9. -- Dan Tufiş, Radu Ion, Alexandru Ceauşu, and Dan Ştefănescu (2006). Improved Lexical Alignment by Combining Multiple Reified Alignments. In Toru Ishida, Susan R. Fussell, and Piek T.J.M. Vossen (eds.), Proceedings of the 11th Conference EACL2006, pp. 153-160, Trento, Italy, April 2006. Association for Computational Linguistics. ISBN 1-9324-32-61-2.
Rights:: Not specified

368. York-Helsinki Parsed Corpus of Old English Poetry

Publisher:: University of York
Type:: corpus
Language:: English
Description:: A selection of poetic texts (71,490 words) from the Old English Section of the Helsinki Corpus of English Texts, syntactically and morphologically annotated.
Rights:: Not specified

369. York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE)

Publisher:: University of York
Type:: corpus
Language:: English
Description:: 1.5 million word syntactically-annotated corpus of Old English prose texts
Rights:: Not specified

« Previous
Next »
1
2
…
33
34
35
36
37

361. WMT18 Quality Estimation Shared Task Training and Development Data

362. Word Importance Dataset

363. Word representations for multiple languages

364. Word-final /s/ durations in spoken German

365. WordSim353-cs: Evaluation Dataset for Lexical Similarity and Relatedness, based on WordSim353

366. Wortschatz

367. YAWA - Yet Another Word Aligner

368. York-Helsinki Parsed Corpus of Old English Poetry

369. York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE)

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Show values starting with

Creator

Show values starting with

Format

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Show values starting with

Date

Original context has metadata only

Harvested from