1 - 8 of 8
Number of results to display per page
Search Results
2. Addressed Arabic Phonetic Rules
- Creator:
- Mustafa, Ebtihal and Bouzoubaa, Karim
- Publisher:
- languages journal
- Type:
- text, wordList, and lexicalConceptualResource
- Subject:
- phonetics and Arabic phonetic System.
- Language:
- Arabic
- Description:
- This xml file describes the Arabic phonetic constraints are to be applied on Arabic root. The first rule category lists the letters that may not occur in the same root, regardless of their order. The second category lists the letters that may not be used together in a root word with a specific order. The third and fourth categories show that each contiguous letters must not be redundant ISLRN: 991-445-325-823-5
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
3. AdjDeriNet: Words Derived from Adjectives in Czech
- Creator:
- Ševčíková, Magda and Žabokrtský, Zdeněk
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, wordList, and lexicalConceptualResource
- Subject:
- adjectives, derivation, word-formation, and derivational morphology
- Language:
- Czech
- Description:
- Lexical network AdjDeriNet consists of pairs of base adjectives and their derivatives. It contains nearly 18 thousand base adjectives that are base words for more than 26 thousand lexemes of several parts of speech.
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
4. Broken plural list
- Creator:
- Ouamer, meriem, Bouzoubaa, Karim, and Tajmout, rachida
- Publisher:
- ALELM research group
- Type:
- text, wordList, and lexicalConceptualResource
- Subject:
- Broken plural
- Language:
- Arabic
- Description:
- An LMF conformant XML-based file containing a comprehensive Arabic broken plural list. The file contains 12,249 singular words with their corresponding BPs
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
5. Czech Multiword Expressions
- Creator:
- Nevěřilová, Zuzana
- Publisher:
- Faculty of Informatics, Masaryk University
- Type:
- text, wordList, and lexicalConceptualResource
- Subject:
- multiword expressions
- Language:
- Czech
- Description:
- The dataset contains 4731 frozen continuous Czech multiword expressions. Inflectional word forms are generated for those MWEs where applicable. In total, the dataset contains 24,807 MWE forms.
- Rights:
- Public Domain Mark (PD), http://creativecommons.org/publicdomain/mark/1.0/, and PUB
6. Czech SubLex 1.0
- Creator:
- Veselovská, Kateřina and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, lexicalConceptualResource, and wordList
- Subject:
- subjectivity lexicon, sentiment analysis, opinion mining, and polarity clues
- Language:
- Czech
- Description:
- Czech subjectivity lexicon, i.e. a list of subjectivity clues for sentiment analysis in Czech. The list contains 4626 evaluative items (1672 positive and 2954 negative) together with their part of speech tags, polarity orientation and source information. The core of the Czech subjectivity lexicon has been gained by automatic translation of a freely available English subjectivity lexicon downloaded from http://www.cs.pitt.edu/mpqa/subj_lexicon.html. For translating the data into Czech, we used parallel corpus CzEng 1.0 containing 15 million parallel sentences (233 million English and 206 million Czech tokens) from seven different types of sources automatically annotated at surface and deep layers of syntactic representation. Afterwards, the lexicon has been manually refined by an experienced annotator. and The work on this project has been supported by the GAUK 3537/2011 grant and by SVV project number 267 314.
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
7. FAspell
- Creator:
- QasemiZadeh, Behrang
- Publisher:
- Behrang-QasemiZadeh
- Type:
- text, wordList, and lexicalConceptualResource
- Subject:
- spellchecking, spellchecker, and Evaluation Dataset for Automatic Spell Checking
- Language:
- Persian
- Description:
- FASpell dataset was developed for the evaluation of spell checking algorithms. It contains a set of pairs of misspelled Persian words and their corresponding corrected forms similar to the ASpell dataset used for English. The dataset consists of two parts: a) faspell_main: list of 5050 pairs collected from errors made by elementary school pupils and professional typists. b) faspell_ocr: list of 800 pairs collected from the output of a Farsi OCR system.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
8. WordSim353-cs: Evaluation Dataset for Lexical Similarity and Relatedness, based on WordSim353
- Creator:
- Cinková, Silvie, Straková, Jana, Hajič, Jakub, Hajič, Jan, Hajič, Jan, jr., Janoušková, Jolana, Straka, Milan, and Urešová, Miroslava
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, wordList, and lexicalConceptualResource
- Subject:
- lexical semantics, similarity, relatedness, evaluation, and distributional semantics
- Language:
- Czech and English
- Description:
- Czech translation of WordSim353. The Czech translation of English WordSim353 word pairs were obtained from four translators. All translation variants were scored according to the lexical similarity/relatedness annotation instructions for WordSim353 annotators, by 25 Czech annotators. The resulting data set consists of two annotation files: "WordSim353-cs.csv" and "WordSim-cs-Multi.csv". Both files are encoded in UTF-8, have a header, text is enclosed in double quotes, and columns are separated by commas. The rows are numbered. The WordSim-cs-Multi data set has rows numbered from 1 to 634, whereas the row indices in the WordSim353-cs data set reflect the corresponding row numbers in the WordSim-cs-Multi data set. The WordSim353-cs file contains a one-to-one mapping selection of 353 Czech equivalent pairs whose judgments have proven to be most similar to the judgments of their corresponding English originals (compared by the absolute value of the difference between the means over all annotators in each language counterpart). In one case ("psychology-cognition"), two Czech equivalent pairs had identical means as well as confidence intervals, so we randomly selected one. The "WordSim-cs-Multi.csv" file contains human judgments for all translation variants. In both data sets, we preserved all 25 individual scores. In the WordSim353-cs data set, we added a column with their Czech means as well as a column containing the original English means and 95% confidence intervals in separate columns for each mean (computed by the CI function in the Rmisc R package). The WordSim-cs-Multi data set contains only the Czech means and confidence intervals. For the most convenient lexical search, we provided separate columns with the respective Czech and English single words, entire word pairs, and eventually an English-Czech quadruple in both data sets. The data set also contains an xls table with the four translations and a preliminary selection of the best variants performed by an adjudicator.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB