An XML-based file containing the electronic version of al wassit dictionary. An Arabic monolingual dictionary accomplished by the Academy of the Arabic Language in Cairo
An LMF conformant XML-based file containing the electronic version of al wassit dictionary. An Arabic monolingual dictionary accomplished by the Academy of the Arabic Language in Cairo
A Gold Standard Word Alignment for English-Swedish (GES) is a resource containing 1164 manually word aligned sentences pairs from English and Swedish versions of Europarl v. 2.
The data can be found here: https://www.ida.liu.se/labs/nlplab/ges/
A Gold Standard Word Alignment for English-Swedish (GES) is a resource containing 1164 manually word aligned sentences pairs from English and Swedish versions of Europarl v. 2.
This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
These are supplementary materials for an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification and is available at http://hdl.handle.net/11234/1-4615. These supplementary materials contain OCR texts from different OCR engines for book pages for which we have both high-resolution scanned images and annotations for OCR evaluation.
A morphological layer for the German part of the SMULTRON corpus. Layer was annotated according to the STTS tagset and the annotation guidelines of the Tiger corpus.
Coordinator: Thomas Müller
Annotators: Francesca Caratti, Arne Recknagel
This distribution contains a morphological layer for the SMULTRON corpus [0].
The annotation process is described in :
@InProceedings{mueller2015,
author = {M\"uller, Thomas and Sch\"utze, Hinrich},
title = {Robust Morphological Tagging with Word Representations},
booktitle = {Proceedings of NAACL},
year = {2015},
}
[0] http://www.cl.uzh.ch/research/parallelcorpora/paralleltreebanks/smultron_en.html
Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.org/wmt11/test.tgz), translated from the German originals. Original segmentation of the WMT 2011 data is preserved. and This project has been sponsored by the grants GAČR P406/11/1499 and EuroMatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003+7E11051 of the Ministry of Education, Youth and Sports of the Czech Republic)
This xml file describes the Arabic phonetic constraints are to be applied on Arabic root. The first rule category lists the letters that may not occur in the same root, regardless of their order. The second category lists the letters that may not be used together in a root word with a specific order. The third and fourth categories show that each contiguous letters must not be redundant
ISLRN: 991-445-325-823-5
Lexical network AdjDeriNet consists of pairs of base adjectives and their derivatives. It contains nearly 18 thousand base adjectives that are base words for more than 26 thousand lexemes of several parts of speech.
Phonological networks are representations of word forms and their phonological relationships with other words in a given language lexicon. A principle underlying the growth (or evolution) of those networks is preferential attachment, or the ‘rich-gets-richer’ mechanisms, according to which words with many phonological neighbors (or links) are the main beneficiaries of future growth opportunities. Due to their limited number of words, language lexica constitute node-constrained networks where growth cannot keep increasing in a linear way; hence, preferential attachment is likely mitigated by certain factors. The present study investigated aging effects (i.e., a word’s finite time span of being active in terms of growth) in an evolving phonological network of English as a second language. It was found that phonological neighborhoods are constructed by one large initial lexical spurt, followed by sublinear growth spurts that eventually lead to very limited growth in later lexical spurts during network evolution, all the while obeying the law of preferential attachment. An analysis of the strength of phonological relationships between phonological word forms revealed a tendency to attach more distant phonological neighbors in the lower proficiency levels, while phonologically more similar neighbors enter phonological neighborhoods at more advanced levels of English as a second language. Overall, the findings suggest an aging effect in growth that favors younger words. In addition, beginning learners seem to prefer the acquisition of phonological neighbors that are easier to discriminate. Implications for the second language lexicon include leveraged learning mechanisms, learning bouts focussed on a smaller range of phonological segments, and involve questions concerning lexical processing in aging networks.
Corpus AKCES 2 consists of trancripts of recordings of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora). It is the same data as the corpus "Schola 2010" (see the link for search), but all the proper names have been removed in order to protect the privacy of participants. and MŠMT (MSM0021620825), UK (PRVOUK P 10)
Corpus AKCES 2 ver. 2 consists of full, unabridged trancripts of recordings of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora). It is the same data as the corpus "Schola 2010" (see the link for search), but all the proper names have been removed in order to protect the privacy of participants. and UK, PRVOUK P10
Corpus AKCES 3 includes texts written in czech by non-native speakers (AKCES/CLAC - Czech Language Acquisition Corpora) and ESF (OPVK CZ.1.07/2.2.00/07.0259), MŠMT (MSM0021620825), UK (P10)
Corpus AKCES 4 includes texts written in czech by youth growing up in locations at risk of social exclusion (AKCES/CLAC - Czech Language Acquisition Corpora) and ESF (OPVK CZ.1.07/2.2.00/07.0259), MŠMT (MSM0021620825), UK (P10)
Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition Corpora. CzeSL-SGT stands for Czech as a Second Language with Spelling, Grammar and Tags. Extends the “foreign” (ciz) part of AKCES 3 (CzeSL-plain) by texts collected in 2013. Original forms and automatic corrections are tagged, lemmatized and assigned erros labels. Most texts have metadata attributes (30 items) about the author and the text.
Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition Corpora. CzeSL-SGT stands for Czech as a Second Language with Spelling, Grammar and Tags. Extends the “foreign” (ciz) part of AKCES 3 (CzeSL-plain) by texts collected in 2013. Original forms and automatic corrections are tagged, lemmatized and assigned erros labels. Most texts have metadata attributes (30 items) about the author and the text.
In addition to a few minor bugs, fixes a critical issue in Release 1: the native speakers of Ukrainian (s_L1:"uk") were wrongly labelled as speakers of "other European languages" (s_L1_group="IE"), instead of speakers of a Slavic language (s_L1_group="S"). The file is now a regular XML document, with all annotation represented as XML attributes.
AKCES-GEC is a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format.
Note that in comparison to CZESL-GEC dataset, this dataset contains separated edits together with their type annotations in M2 format and also has two times more sentences.
If you use this dataset, please use following citation:
@article{naplava2019wnut,
title={Grammatical Error Correction in Low-Resource Scenarios},
author={N{\'a}plava, Jakub and Straka, Milan},
journal={arXiv preprint arXiv:1910.00353},
year={2019}
}