Harvested from: LINDAT/CLARIAH-CZ repository / Type: lexicalConceptualResource

1. "Al wassit" Arabic dictionary

Creator:: Namly, Driss
Publisher:: Ibtikarat team
Type:: text, lexicon, and lexicalConceptualResource
Subject:: lexical semantics
Language:: Arabic
Description:: An XML-based file containing the electronic version of al wassit dictionary. An Arabic monolingual dictionary accomplished by the Academy of the Arabic Language in Cairo
Rights:: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

2. "Al wassit" LMF Arabic dictionary

Creator:: Namly, Driss
Publisher:: Ibtikarat team
Type:: text, lexicon, and lexicalConceptualResource
Subject:: lexical semantics
Language:: Arabic
Description:: An LMF conformant XML-based file containing the electronic version of al wassit dictionary. An Arabic monolingual dictionary accomplished by the Academy of the Arabic Language in Cairo
Rights:: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

3. A Gold Standard Word Alignment for English-Swedish

Creator:: Ahrenberg, Lars and Holmqvist, Maria
Publisher:: Linköping University
Type:: text, wordList, and lexicalConceptualResource
Subject:: word alignment
Language:: Swedish and English
Description:: A Gold Standard Word Alignment for English-Swedish (GES) is a resource containing 1164 manually word aligned sentences pairs from English and Swedish versions of Europarl v. 2. The data can be found here: https://www.ida.liu.se/labs/nlplab/ges/
Rights:: Not specified

4. A Gold Standard Word Alignment for English-Swedish (2015-10-12)

Creator:: Ahrenberg, Lars and Holmqvist, Maria
Publisher:: Linköping University
Type:: text, wordList, and lexicalConceptualResource
Subject:: word alignment
Language:: Swedish and English
Description:: A Gold Standard Word Alignment for English-Swedish (GES) is a resource containing 1164 manually word aligned sentences pairs from English and Swedish versions of Europarl v. 2.
Rights:: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

5. Addressed Arabic Phonetic Rules

Creator:: Mustafa, Ebtihal and Bouzoubaa, Karim
Publisher:: languages journal
Type:: text, wordList, and lexicalConceptualResource
Subject:: phonetics and Arabic phonetic System.
Language:: Arabic
Description:: This xml file describes the Arabic phonetic constraints are to be applied on Arabic root. The first rule category lists the letters that may not occur in the same root, regardless of their order. The second category lists the letters that may not be used together in a root word with a specific order. The third and fourth categories show that each contiguous letters must not be redundant ISLRN: 991-445-325-823-5
Rights:: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

6. AdjDeriNet: Words Derived from Adjectives in Czech

Creator:: Ševčíková, Magda and Žabokrtský, Zdeněk
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, wordList, and lexicalConceptualResource
Subject:: adjectives, derivation, word-formation, and derivational morphology
Language:: Czech
Description:: Lexical network AdjDeriNet consists of pairs of base adjectives and their derivatives. It contains nearly 18 thousand base adjectives that are base words for more than 26 thousand lexemes of several parts of speech.
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

7. Álgu – Origins of Saami Words

Publisher:: The Research Institute for the Languages of Finland
Type:: lexicalConceptualResource
Language:: Northern Sami
Description:: The database will contain an etymological lexicon of Saami languages complete with detailed source citations. The database will be open to the public in November 2006 and will be updated regularly.
Rights:: Not specified

8. Álgu – Origins of Saami Words (Álgu – Saamen sanojen etymologinen tietokanta)

Type:: lexicalConceptualResource
Description:: 70,000 words, over 100,000 etymological relations, Relational database
Rights:: Not specified

9. Aquén - Toponimia galega

Publisher:: TALG Research Group (University of Vigo)
Type:: lexicalConceptualResource
Language:: Galician
Description:: Galician Toponymy Database, 40,000 entries
Rights:: Not specified

10. Arabic characters lexicon

Creator:: Namly, Driss
Publisher:: Ibtikarat team
Type:: text, lexicon, and lexicalConceptualResource
Subject:: alphabets
Language:: Arabic
Description:: A XML-based file containing all Arabic characters (letters, vowels and punctuations). Each character described with a description, different displays (isolated, at the beginning, middle and the end of a word), a codification (Unicode, others could be added later), and two transliterations (Buckwalter and wiki)
Rights:: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

11. Arabic Enclitics Lexicon

Creator:: Loukili, Taoufik
Publisher:: Ibtikarat team
Type:: text, lexicon, and lexicalConceptualResource
Subject:: Enclitics
Language:: Arabic
Description:: An XML-based file containing all Arabic enclitics
Rights:: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

12. Arabic Morphological evaluation corpus

Creator:: Jaafar, Younes
Publisher:: Ibtikarat team
Type:: text, wordList, and lexicalConceptualResource
Subject:: morphological analysis and benchmarking corpus
Language:: Arabic
Description:: An annotated corpus dedicated to the benchmark and evaluation of Arabic morphological analyzers. It consists of 100 words with all their possible analysis. The corpus contains several morphological information such as stem, pattern, root, lemma, etc.
Rights:: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

13. Arabic Particles Lexicon

Creator:: Namly, Driss
Publisher:: Ibtikarat team
Type:: text, lexicon, and lexicalConceptualResource
Subject:: particles
Language:: Arabic
Description:: An XML-based file containing Arabic particles
Rights:: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

14. Arabic Phonetic Rules

Creator:: Mustafa, Ebtihal and Bouzoubaa, Karim
Publisher:: languages journal
Type:: text, other, and lexicalConceptualResource
Subject:: Arabic and phonetic rules
Language:: Arabic
Description:: Description: this xml file describes the Arabic phonetic constraints (rules) resulting from the analysis of the lexicons(Taj Alarous, Al ain, Lisan Al arab, Alwassit and almoassir ). These rules are to be applied to Arabic roots and are classified into a number of categories. Each category has a certain type of constraints as follow: The first category defines that the root must not consist of three identical letters. The second category defines that the root must not start with two repeating letters. The third category lists the letters that must not occur in the same root, regardless of their order. The fourth category lists the letters that may not be used together in a certain order in a root. ISLRN: 190-535-098-473-3
Rights:: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

15. Arabic Proclitics Lexicon

Creator:: Loukili, Taoufik
Publisher:: Ibtikarat team
Type:: text, lexicon, and lexicalConceptualResource
Subject:: proclitics
Language:: Arabic
Description:: An XML-based file containing all Arabic proclitics
Rights:: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

16. Arabic Special verbs Lexicon

Creator:: Namly, Driss
Publisher:: Ibtikarat team
Type:: text, lexicon, and lexicalConceptualResource
Subject:: particles
Language:: Arabic
Description:: An XML-based file containing Arabic Stop-words respecting verbs syntax
Rights:: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

17. Arabic Triliteral roots Lexicon

Creator:: Mustafa, Ebtihal and Bouzoubaa, Karim
Publisher:: MDPI language jurnal
Type:: text, lexicon, and lexicalConceptualResource
Subject:: Arabic language, Arabic roots, lexicons, phonetic system, bigram frequencies, and roots weight.
Language:: Arabic
Description:: Description: This xml file is a lexicon containing all 21952 (28x28x28) Arabic triliteral combinations (roots). the file is split into three parts as follow: the first part contains the phonetic constraints that must be taken into account in the formation of Arabic roots (for more details see all_phonetic_rules.xml in http://arabic.emi.ac.ma/alelm/?q=Resources). the second part contains the lexicons that were used to create this lexicon (see in lexicons tag). the third part contains the roots. ISLRN: 813-907-570-946-2
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

18. Arabic WordNet ontology

Creator:: Abouenour, Lahcen, Bouzoubaa, Karim, and Rosso, Paulo
Publisher:: Wordnet
Type:: text, wordnet, and lexicalConceptualResource
Subject:: WordNet
Language:: Arabic
Description:: This improved version is an extension of the original Arabic Wordnet (http://globalwordnet.org/arabic-wordnet/awn-browser/), it was enriched by new verbs, nouns including the broken plurals that is a specific form for Arabic words.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

19. ATCC: Pronunciation lexicon and n-gram counts for ASR module

Creator:: Šmídl, Luboš
Publisher:: University of West Bohemia, Department of Cybernetics
Type:: text, lexicalConceptualResource, and other
Subject:: pronunciation lexicon, n-gram counts, and language model
Language:: English
Description:: The corpus contains pronunciation lexicon and n-gram counts (unigrams, bigrams and trigrams) that can be used for constructing the language model for air traffic control communication domain. It could be used together with the Air Traffic Control Communication corpus (http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0). and Technology Agency of the Czech Republic, project No. TA01030476
Rights:: Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0), http://creativecommons.org/licenses/by-nc/3.0/, and PUB

20. Banco de neologismos 2004-2007

Publisher:: Instituto Cervantes and Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: lexicalConceptualResource
Subject:: neologisms database
Language:: Catalan
Description:: Repository of neologisms (15.375 entries)
Rights:: Not specified

21. Base de synonymes CRISCO

Type:: lexicalConceptualResource
Language:: French
Description:: 49.000, RDB
Rights:: Not specified

22. Basic vocabulary on the Human Genome

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: lexicalConceptualResource
Language:: Catalan, English, French, Galician, Italian, Portuguese, and Spanish
Description:: A vocabulary resulting from the cooperation of the groups of REALITER network that collects the basic terminology mostly used in texts about Genomics. It contains equivalents in English, Peninsular and Latinamerican Spanish, French, Italian, Galician, Portuguese and Catalan.
Rights:: Not specified

23. Bavaria's Dialects Online

Creator:: Raaf, Manuel
Publisher:: Bayerische Akademie der Wissenschaften and Bavarian Academy of Sciences and Humanities
Type:: text, machineReadableDictionary, and lexicalConceptualResource
Subject:: dictionary, web dictionary, Dialektologie, dialect variation, language variation, Dialectology, dialectology, Bavarian, Bavaria, Swabian, Frankish, Franconian Language, and spoken language
Language:: German, Bavarian, Swabian, and Frankish
Description:: Bavaria's Dialects Online (BDO) is the digital language information system of the three projects "Bavarian Dictionary", "Franconian Dictionary", and "Dialectological Information System of Bavarian Swabia". The database combines the research results of dialect research and presents dictionary articles as well as research data in a freely accessible online tool. BDO is not only aimed at scholars, but also at the lay public interested in the language. Here, the vocabulary of all Bavarian dialects is collected in one place and made accessible. The system shows the richness of the dialects of Bavaria in combination. With the new database, one will be able to compare the dialect vocabulary of Old Bavaria, Franconia and Swabia. Authentic dialect evidence is used to illustrate the dialect words in their variety of meanings and regional distribution, as well as to show their use in idioms, proverbs, and much more. BDO allows a whole new look at the vocabulary of the dialects of all parts of the state of Bavaria.
Rights:: Not specified

24. Bilder-Conversations-Lexikon

Type:: lexicalConceptualResource
Subject:: Germanistik
Language:: German
Description:: digitale Ausgabe der ersten Auflage des "Bilder-Conversations-Lexikons für das deutsche Volk" (1837-1841); "Handbuch zur Verbreitung gemeinnütziger Kenntnisse und zur Unterhaltung" (Selbstbeschreibung im Vorwort); beinhaltet zahlreiche Abbildungen und Landkarten
Rights:: Not specified

25. Bokmålsordboka

Publisher:: Department of Linguistics and Nordic Studies, University of Oslo
Type:: lexicalConceptualResource
Description:: 65 000 entries with definitions, etymology, examples
Rights:: Not specified

26. Bosworth-Toller’s Anglo-Saxon Dictionary online

Creator:: Tichý, Ondřej, Roček, Martin, Bočková, Renata, Čermák, Matěj, Dragounová, Jolana, Filipová, Helena, Gilová, Lucie, Hejná, Michaela, Hladíková, Lenka, Hladká, Alena, Hubinová, Veronika, Krajcsovicsová, Vlaďena, Kupková, Tatiana, Lebedeva, Tatiana, Malečková, Nikola, Novotná, Alena, Pazderová, Tereza, Popelíková, Jiřina, Rumlová, Jana, Tyčová Ocelík, Dana, Volná, Veronika, and Zahradníková, Tereza
Publisher:: Charles University, Faculty of Arts, Department of English Language and ELT Methodology
Type:: text, lexicon, and lexicalConceptualResource
Subject:: English, Old English, Anglo-Saxon, dictionary, Bosworth, Toller, lexicography, digitalization, English history, Mediaeval, and Medieval
Language:: English, Old English (ca. 450-1100), Latin, Ancient Greek (to 1453), and Ancient Hebrew
Description:: Description : This is an online edition of An Anglo-Saxon Dictionary, or a dictionary of "Old English". The dictionary records the state of the English language as it was used between ca. 700-1100 AD by the Anglo-Saxon inhabitants of the British Isles. This project is based on a digital edition of An Anglo-Saxon dictionary, based on the manuscript collections of the late Joseph Bosworth (the so called Main Volume, first edition 1898) and its Supplement (first edition 1921), edited by Joseph Bosworth and T. Northcote Toller, today the largest complete dictionary of Old English (one day to be hopefully supplanted by the DOE). Alistair Campbell's "enlarged addenda and corrigenda" from 1972 are not public domain and are therefore not part of the online dictionary. Please see the front & back matter of the paper dictionary for further information, prefaces and lists of references & contractions. The digitization project was initiated by Sean Crist in 2001 as a part of his Germanic Lexicon Project and many individuals and institutions have contributed to this project. Check out the original GLP webpage and the old Bosworth-Toller offline application webpage (to be updated). Currently the project is hosted by the Faculty of Arts, Charles University. In 2010, the data from the GLP were converted to create the current site. Care was taken to preserve the typography of the original dictionary, but also provide a modern, user friendly interface for contemporary users. In 2013, the entries were structurally re-tagged and the original typography was abandoned, though the immediate access to the scans of the paper dictionary was preserved. Our aim is to reach beyond a simple digital edition and create an online environment dedicated to all interested in Old English and Anglo-Saxon culture. Feel free to join in the editing of the Dictionary, commenting on its numerous entries or participating in the discussions at our forums. We hope that by drawing the attention of the community of Anglo-Saxonists to our site and joining our resources, we may create a more useful tool for everybody. The most immediate project to draw on the corrected and tagged data of the Dictionary is a Morphological Analyzer of Old English (currently under development). We are grateful for the generous support of the Charles University Grant Agency and for the free hosting at the Faculty of Arts at Charles University. The site is currently maintained and developed by Ondrej Tichy et al. at the Department of English Language and ELT Methodology, Faculty of Arts, Charles University in Prague (Czech Republic).
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

27. Brockhaus' Kleines Konversations-Lexikon

Type:: lexicalConceptualResource
Subject:: Germanistik
Language:: German
Description:: 5. Aufl. 1911; Fokus auf Politik, Wirtschaft, Kultur und Technik zu Beginn des 20. Jahrhunderts
Rights:: Not specified

28. Broken plural list

Creator:: Ouamer, meriem, Bouzoubaa, Karim, and Tajmout, rachida
Publisher:: ALELM research group
Type:: text, wordList, and lexicalConceptualResource
Subject:: Broken plural
Language:: Arabic
Description:: An LMF conformant XML-based file containing a comprehensive Arabic broken plural list. The file contains 12,249 singular words with their corresponding BPs
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

29. BulTreeBank Frequency List

Type:: lexicalConceptualResource
Language:: Bulgarian
Description:: 100 000 most frequent Cyrillic tokens in the BulTreeBank text archive, UTF-16 list of token-frequency pairs
Rights:: Not specified

30. BulTreeBank Stopword List

Type:: lexicalConceptualResource
Language:: Bulgarian
Description:: 805 prepositions, pronouns, etc stop words, UTF-16 list of wordforms
Rights:: Not specified

31. CALEM (Comprehensive Arabic LEMmas)

Creator:: Namly, Driss, Bouzoubaa, Karim, and El Jihad, Abdelhamid
Publisher:: ALELM
Type:: text, lexicon, and lexicalConceptualResource
Subject:: lexicon, lemmatization, and stemming;
Language:: Arabic
Description:: Comprehensive Arabic LEMmas is a lexicon covering a large list of Arabic lemmas and their corresponding inflected word forms (stems) with details (POS + Root). Each lexical entry represents a lemma followed by all its possible stems and each stem is enriched by its morphological features especially the root and the POS. It is composed of 164,845 lemmas representing 7,200,918 stems, detailed as follow: 757 Arabic particles 2,464,631 verbal stems 4,735,587 nominal stems The lexicon is provided as an LMF conformant XML-based file in UTF8 encoding, which represents about 1,22 Gb of data. Citation: – Namly Driss, Karim Bouzoubaa, Abdelhamid El Jihad, and Si Lhoussain Aouragh. “Improving Arabic Lemmatization Through a Lemmas Database and a Machine-Learning Technique.” In Recent Advances in NLP: The Case of Arabic Language, pp. 81-100. Springer, Cham, 2020.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

32. canoonet – Deutsche Wörterbücher und Grammatik

Type:: lexicalConceptualResource
Subject:: Germanistik
Language:: German
Description:: Angabe von orthographischen, morphologischen (Wortformenbildung und Wortbildung) sowie semantischen Informationen (Synonymie; Hyperonymie/Hyponymie); Zuordnung der Wörter zu der jeweiligen syntaktischen Kategorie (bei Substantiven zusätzlich Angabe des Genus)
Rights:: Not specified

33. CELEX (web version)

Publisher:: Max Planck Institute for Psycholinguistics
Type:: lexicalConceptualResource
Language:: Dutch, English, and German
Rights:: Not specified

34. CLEF-TREC Q/A

Creator:: Abouenour, lahcen, Bouzoubaa, Karim, and Rosso, paolo
Publisher:: ALELM
Type:: text, other, and lexicalConceptualResource
Subject:: CLEF and TREC
Language:: Arabic
Description:: List of 2264 questions + answers of CLEF and TREC, translated to Arabic
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

35. Cologne Digital Sanskrit Dictionaries

Publisher:: Institute of Indology and Tamil Studies, Cologne University
Type:: lexicalConceptualResource
Language:: Sanskrit
Description:: Sanskrit lexicons. The data is made available as scanned images of the works as well as a digitization of the scanned images, which permits computer-aided analyses and displays of the work. Can be downloaded or queried online.
Rights:: Not specified

37. ConsILR - Consortium for the Romanian Language: Resources & Tools

Type:: lexicalConceptualResource
Language:: English and Romanian
Description:: Resources and tools developed for Romanian
Rights:: Not specified

38. Contemporary Arabic dictionary

Creator:: Namly, Driss
Publisher:: Ibtikarat Team
Type:: text, lexicon, and lexicalConceptualResource
Subject:: lexical semantics
Language:: Arabic
Description:: An XML-based file containing the electronic version of al logha al arabia al moassira (Contemporary Arabic) dictionary. An Arabic monolingual dictionary accomplished by Ahmed Mukhtar Abdul Hamid Omar (deceased: 1424) with the help of a working group
Rights:: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

39. Conversations-Lexikon oder kurzgefaßtes Handwörterbuch

Type:: lexicalConceptualResource
Subject:: Germanistik
Language:: German
Description:: 1. Aufl. 1809-1811; Darstellung der Gegenstandsbereiche gesellschaftlicher Konversation; Berücksichtigung bedeutender historischer Ereignisse
Rights:: Not specified

40. Covid-19 Thesaurus

Creator:: Fener, Patricia
Publisher:: Institute for scientific and technical information (Inist) - CNRS/UAR76
Type:: thesaurus, text, and lexicalConceptualResource
Subject:: COVID-19, SARS coronavirus, Middle-East coronavirus, SARS-CoV, and MERS-CoV
Language:: French and English
Description:: This bilingual thesaurus (French-English), developed at Inist-CNRS, covers the concepts from the emerging COVID-19 outbreak which reminds the past SARS coronavirus outbreak and Middle East coronavirus outbreak. This thesaurus is based on the vocabulary used in scientific publications for SARS-CoV-2 and other coronaviruses, like SARS-CoV and MERS-CoV. It provides a support to explore the coronavirus infectious diseases. The thesaurus can be browsed and queried by humans and machines on the Loterre portal (https://www.loterre.fr), via an API and an rdf triplestore. It is also downloadable in PDF, SKOS, csv and json-ld formats. The thesaurus is made available under a CC-by 4.0 license.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), PUB, and http://creativecommons.org/licenses/by/4.0/

42. Croatian Morphological Lexicon

Publisher:: University of Zagreb, Faculty of Humanities and Social Sciences
Type:: lexicalConceptualResource
Language:: Croatian
Description:: 110,000+ lemmas; 3,900,000+ word-forms, MulText East lexica format
Rights:: Not specified

43. Czech Lexico-Semantic Database 0.1

Creator:: Tichy, Ondrej, Obstova, Zora, and Klegr, Ales
Publisher:: Charles University, Faculty of Arts
Type:: text, thesaurus, and lexicalConceptualResource
Subject:: onomasiological lexicography, thesaurus, lexico-semantic database, digitization, and Czech
Language:: Czech
Description:: A lexicographical project, whose aim is to digitize and align two Czech onomasiological dictionaries (Haller 1969–77; Klégr 2007) in order to create an integrated digital multi-purpose lexico-semantic database of Czech.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

44. Czech Multiword Expressions

Creator:: Nevěřilová, Zuzana
Publisher:: Faculty of Informatics, Masaryk University
Type:: text, wordList, and lexicalConceptualResource
Subject:: multiword expressions
Language:: Czech
Description:: The dataset contains 4731 frozen continuous Czech multiword expressions. Inflectional word forms are generated for those MWEs where applicable. In total, the dataset contains 24,807 MWE forms.
Rights:: Public Domain Mark (PD), http://creativecommons.org/publicdomain/mark/1.0/, and PUB

45. Czech OOV Inflection Dataset

Creator:: Sourada, Tomáš
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, computationalLexicon, and lexicalConceptualResource
Subject:: morphological generation, morphology, neologisms database, and Czech
Language:: Czech
Description:: Czech OOV Inflection Dataset is a Czech inflection dataset of nouns, focused on evaluation in out-of-vocabulary (OOV) conditions. It consists of two parts: a standard lemma-disjoint train-dev-test split of a subset of noun paradigms of existing morphological dictionary Czech MorfFlex 2.0 (files train, dev and test-MorfFlex); and small set of neologisms from Čeština 2.0, annotated for inflected forms (file test-neologisms).
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

46. Czech SubLex 1.0

Creator:: Veselovská, Kateřina and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, lexicalConceptualResource, and wordList
Subject:: subjectivity lexicon, sentiment analysis, opinion mining, and polarity clues
Language:: Czech
Description:: Czech subjectivity lexicon, i.e. a list of subjectivity clues for sentiment analysis in Czech. The list contains 4626 evaluative items (1672 positive and 2954 negative) together with their part of speech tags, polarity orientation and source information. The core of the Czech subjectivity lexicon has been gained by automatic translation of a freely available English subjectivity lexicon downloaded from http://www.cs.pitt.edu/mpqa/subj_lexicon.html. For translating the data into Czech, we used parallel corpus CzEng 1.0 containing 15 million parallel sentences (233 million English and 206 million Czech tokens) from seven different types of sources automatically annotated at surface and deep layers of syntactic representation. Afterwards, the lexicon has been manually refined by an experienced annotator. and The work on this project has been supported by the GAUK 3537/2011 grant and by SVV project number 267 314.
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

47. Czech translation of the EBUContentGenre thesaurus

Creator:: Ircing, Pavel
Publisher:: University of West Bohemia, Department of Cybernetics
Type:: text, lexicalConceptualResource, and thesaurus
Subject:: thesaurus, metadata annotation, and topic detection
Language:: Czech and English
Description:: The EBUContentGenre is a thesaurus containing the hierarchical description of various genres utilized in the TV broadcasting industry. This thesaurus is a part of a complex metadata specification called EBUCore intended for multifaceted description of audiovisual content. EBUCore (http://tech.ebu.ch/docs/tech/tech3293v1_3.pdf) is a set of descriptive and technical metadata based on the Dublin Core and adapted to media. EBUCore is the flagship metadata specification of European Broadcasting Union, the largest professional association of broadcasters around the world. It is developed and maintained by EBU's Technical Department (http://tech.ebu.ch). The translated thesaurus can be used for effective cataloguing of (mostly TV) audiovisual content and consequent development of systems for automatic cataloguing (topic/genre detection). and Technology Agency of the Czech Republic, project No. TA01011264
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

48. Czech Verbal MWEs

Creator:: Bejček, Eduard
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, lexicon, and lexicalConceptualResource
Subject:: lexicon, verbs, multiword expressions, forms, and lemmatization
Language:: Czech
Description:: Lexicon of Czech verbal multiword expressions (VMWEs) used in Parseme Shared Task 2017. https://typo.uni-konstanz.de/parseme/index.php/2-general/142-parseme-shared-task-on-automatic-detection-of-verbal-mwes Lexicon consists of 4785 VMWEs, categorized into four categories according to Parseme Shared Task (PST) typology: IReflV (inherently reflexive verbs), LVC (light verb constructions), ID (idiomatic expressions) and OTH (other VMWEs with other than verbal syntactic head). Verbal multiword expressions as well as deverbative variants of VMWEs were annotated during the preparation phase of PST. These data were published as http://hdl.handle.net/11372/LRT-2282. Czech part includes 14,536 VMWE occurences: 1611 ID 10000 IReflV 2923 LVC 2 OTH This lexicon was created out of Czech data. Each lexicon entry is represented by one line in the form: type lemmas frequency PoS [used form 1; used form 2; ... ] (columns are separated by tabs) where: type ... is the type of VMWE in PST typology lemmas ... are space separated lemmatized forms of all words that constitutes the VMWE frequency ... is the absolute frequency of this item in PST data PoS ... is a space separated list of parts of speech of individual words (in the same order as in "lemmas") final field contains a list of all (1 to 18) used forms found in the data (since Czech is a flective language).
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

49. CzeDLex 0.5

Creator:: Mírovský, Jiří, Synková, Pavlína, Rysová, Magdaléna, and Poláková, Lucie
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, lexicon, and lexicalConceptualResource
Subject:: lexicon and discourse annotation
Language:: Czech
Description:: CzeDLex 0.5 is a pilot version of a lexicon of Czech discourse connectives. The lexicon contains connectives partially automatically extracted from the Prague Discourse Treebank 2.0 (PDiT 2.0), a large corpus annotated manually with discourse relations. The most frequent entries in the lexicon (covering more than 2/3 of the discourse relations annotated in the PDiT 2.0) have been manually checked, translated to English and supplemented with additional linguistic information.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

50. CzeDLex 0.6

Creator:: Synková, Pavlína, Poláková, Lucie, Mírovský, Jiří, and Rysová, Magdaléna
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, lexicon, and lexicalConceptualResource
Subject:: lexicon and discourse annotation
Language:: Czech
Description:: CzeDLex 0.6 is the second development version of the lexicon of Czech discourse connectives. The lexicon contains connectives partially automatically extracted from the Prague Discourse Treebank 2.0 (PDiT 2.0), a large corpus annotated manually with discourse relations. The most frequent entries in the lexicon (76 out of total 204 entries, covering more than 90% of the discourse relations annotated in PDiT 2.0), have been manually checked, translated to English and supplemented with additional linguistic information.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

51. CzeDLex 0.7

Creator:: Poláková, Lucie, Mírovský, Jiří, Synková, Pavlína, Kloudová, Věra, and Rysová, Magdaléna
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, lexicon, and lexicalConceptualResource
Subject:: lexicon and discourse annotation
Language:: Czech
Description:: CzeDLex 0.7 is the third development version of the Lexicon of Czech discourse connectives. The lexicon contains connectives partially automatically extracted from the Prague Discourse Treebank 2.0 (PDiT 2.0) and, as a supplementary resource, the Czech part of the Prague Czech–English Dependency Treebank with discourse annotation projected from the Penn Discourse Treebank 3.0. The most frequent entries in the lexicon (131 out of total 218 entries, covering more than 95% of discourse relations annotated in PDiT 2.0), have been manually checked, translated to English and supplemented with additional linguistic information.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

52. CzeDLex 1.0

Creator:: Mírovský, Jiří, Synková, Pavlína, Poláková, Lucie, Kloudová, Věra, and Rysová, Magdaléna
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, lexicon, and lexicalConceptualResource
Subject:: lexicon and discourse
Language:: Czech
Description:: CzeDLex 1.0 is the first production version (the fourth development version) of the Lexicon of Czech discourse connectives. The lexicon contains connectives partially automatically extracted from resources annotated manually with discourse relations: the Prague Discourse Treebank 2.0 (PDiT 2.0) as the primary resource, and two supplementary resources: (i) the Czech part of the Prague Czech–English Dependency Treebank with discourse annotation projected from the Penn Discourse Treebank 3.0, and (ii) a thousand sentences selected from various fiction novels and transcriptions of public speeches. All 200 entries in the lexicon have been manually checked, translated to English and supplemented with additional linguistic information.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

53. CzEngClass 0.1

Creator:: Urešová, Zdeňka, Fučíková, Eva, Hajičová, Eva, and Hajič, Jan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, lexicon, and lexicalConceptualResource
Subject:: verbal valency, predicate argument structure, semantic roles, bilingual corpus annotation, translational equivalence, comparative syntax, and comparative semantics
Language:: English and Czech
Description:: The CzEngClass synonym verb lexicon is a result of a project investigating semantic ‘equivalence’ of verb senses and their valency behavior in parallel Czech-English language resources, i.e., relating verb meanings with respect to contextually-based verb synonymy. The lexicon entries are linked to PDT-Vallex (http://hdl.handle.net/11858/00-097C-0000-0023-4338-F), EngVallex (http://hdl.handle.net/11858/00-097C-0000-0023-4337-2), CzEngVallex (http://hdl.handle.net/11234/1-1512), FrameNet (https://framenet.icsi.berkeley.edu/fndrupal/), VerbNet (http://verbs.colorado.edu/verbnet/index.html), PropBank (http://verbs.colorado.edu/%7Empalmer/projects/ace.html), Ontonotes (http://verbs.colorado.edu/html_groupings/), and Czech (http://hdl.handle.net/11858/00-097C-0000-0001-4880-3) and English Wordnets (https://wordnet.princeton.edu/). Part of the dataset is a file reflecting annotators choices for assignment of verbs to classes.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

54. CzEngClass 0.2

Creator:: Urešová, Zdeňka, Fučíková, Eva, Hajičová, Eva, and Hajič, Jan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, lexicon, and lexicalConceptualResource
Subject:: verbal valency, predicate argument structure, semantic roles, bilingual corpus annotation, translational equivalence, comparative syntax, and comparative semantics
Language:: English and Czech
Description:: The CzEngClass synonym verb lexicon is a result of a project investigating semantic ‘equivalence’ of verb senses and their valency behavior in parallel Czech-English language resources, i.e., relating verb meanings with respect to contextually-based verb synonymy. The lexicon entries are linked to PDT-Vallex (http://hdl.handle.net/11858/00-097C-0000-0023-4338-F), EngVallex (http://hdl.handle.net/11858/00-097C-0000-0023-4337-2), CzEngVallex (http://hdl.handle.net/11234/1-1512), FrameNet (https://framenet.icsi.berkeley.edu/fndrupal/), VerbNet (http://verbs.colorado.edu/verbnet/index.html), PropBank (http://verbs.colorado.edu/%7Empalmer/projects/ace.html), Ontonotes (http://verbs.colorado.edu/html_groupings/), and Czech (http://hdl.handle.net/11858/00-097C-0000-0001-4880-3) and English Wordnets (https://wordnet.princeton.edu/). Part of the dataset are files reflecting annotators choices and agreement for assignment of verbs to classes.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

55. CzEngClass 0.3

Creator:: Urešová, Zdeňka, Fučíková, Eva, Hajičová, Eva, and Hajič, Jan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, lexicon, and lexicalConceptualResource
Subject:: verbal valency, predicate argument structure, semantic roles, bilingual corpus annotation, translational equivalence, comparative syntax, and comparative semantics
Language:: English and Czech
Description:: The CzEngClass synonym verb lexicon is a result of a project investigating semantic ‘equivalence’ of verb senses and their valency behavior in parallel Czech-English language resources, i.e., relating verb meanings with respect to contextually-based verb synonymy. The lexicon entries are linked to PDT-Vallex (http://hdl.handle.net/11858/00-097C-0000-0023-4338-F), EngVallex (http://hdl.handle.net/11858/00-097C-0000-0023-4337-2), CzEngVallex (http://hdl.handle.net/11234/1-1512), FrameNet (https://framenet.icsi.berkeley.edu/fndrupal/), VerbNet (http://verbs.colorado.edu/verbnet/index.html), PropBank (http://verbs.colorado.edu/%7Empalmer/projects/ace.html), Ontonotes (http://verbs.colorado.edu/html_groupings/), and Czech (http://hdl.handle.net/11858/00-097C-0000-0001-4880-3) and English Wordnets (https://wordnet.princeton.edu/).
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

56. CzEngVallex

Creator:: Urešová, Zdeňka, Fučíková, Eva, Hajič, Jan, and Šindlerová, Jana
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, lexicon, and lexicalConceptualResource
Subject:: verbal valency, argument structure, valency frame, lexicon, corpus annotation, translation equivalent, comparative syntax, comparative semantics, and valency annotation
Language:: English
Description:: CzEngVallex is a bilingual valency lexicon of corresponding Czech and English verbs. It connects 20835 aligned valency frame pairs (verb senses) which are translations of each other, aligning their arguments as well. The CzEngVallex serves as a powerful, real-text-based database of frame-to-frame and subsequently argument-to-argument pairs and can be used for example for machine translation applications. It uses the data from the Prague Czech-English Dependency Treebank project (PCEDT 2.0, http://hdl.handle.net/11858/00-097C-0000-0015-8DAF-4) and it also takes advantage of two existing valency lexicons: PDT-Vallex for Czech and EngVallex for English, using the same view of valency (based on the Functional Generative Description theory). The CzEngVallex is available in an XML format in the LINDAT/CLARIN repository, and also in a searchable form (see the “More Apps” tab) interlinked with PDT-Vallex (http://hdl.handle.net/11858/00-097C-0000-0023-4338-F),EngVallex (http://hdl.handle.net/11858/00-097C-0000-0023-4337-2) and with examples from the PCEDT.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

57. Damen Conversations Lexikon

Type:: lexicalConceptualResource
Subject:: Germanistik
Language:: German
Description:: Neusatz und Faksimile der zehnbändigen Ausgabe (Leipzig, 1834-1838); wortgenaue Seitenkonkordanz zu der gedruckten Ausgabe; Darstellung der Gegenstandsbereiche gesellschaftlicher Konversation (speziell auf eine weibliche Zielgruppe ausgerichtet)
Rights:: Not specified

58. DanNet

Type:: lexicalConceptualResource
Language:: Danish
Rights:: Not specified

59. Das Deutsche Wörterbuch von Jacob und Wilhelm Grimm

Publisher:: Kompetenzzentrum für elektronische Erschließungs and Publikationsverfahren in den Geisteswissenschaften
Type:: lexicalConceptualResource
Language:: German
Description:: Online edition of the Grimm brothers' "Deutsche Wörterbuch" (1838). Each word shows the Grimms' etymological sources. Also available on CD-ROM
Rights:: Not specified

60. Database of Estonian Multi-word Verbs

Type:: lexicalConceptualResource
Language:: Estonian
Description:: 17 500 entries
Rights:: Not specified

61. Database of etymological references(Etymologinen viitetietokanta)

Type:: lexicalConceptualResource
Language:: Finnish
Description:: 128,000 word forms. Database
Rights:: Not specified

62. DeriNet 1.0

Creator:: Vidra, Jonáš, Žabokrtský, Zdeněk, Ševčíková, Magda, and Straka, Milan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, wordnet, and lexicalConceptualResource
Subject:: derivation, DeriNet, lexical network, and MorfFlex
Language:: Czech
Description:: DeriNet is a lexical network which contains derivational relations in Czech modeled as an oriented graph. Nodes correspond to Czech lexemes (a lexeme is a single lemma, possibly with only a subset of its senses – homonyms may have different derivations and are thus represented by several lexemes) and edges represent derivations between them. DeriNet 1.0 contains 968,967 lexemes with 965,535 unique lemmas; connected by 715,729 derivational links. Lexemes in DeriNet 1.0 are sampled from the MorfFlex dictionary.
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

63. DeriNet 1.2

Creator:: Vidra, Jonáš, Žabokrtský, Zdeněk, Ševčíková, Magda, and Straka, Milan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, wordnet, and lexicalConceptualResource
Subject:: derivation, DeriNet, lexical network, and MorfFlex
Language:: Czech
Description:: DeriNet is a lexical network which models derivational relations in the lexicon of Czech. Nodes of the network correspond to Czech lexemes (i.e. single lemmas, possibly with only a subset of their senses), edges represent derivational relations between a derived word and its base word. The present version, DeriNet 1.2, contains 1,003,590 lexemes (sampled from the MorfFlex dictionary) with 1,001,394 unique lemmas, connected by 740,750 derivational links. Both rather technical and linguistic changes were made as compared to the previous version of the data; e.g. new version of the MorfFlex dictionary was used, derived words that contain a consonant and/or vowel alternation (e.g. boží) were connected with their base word (bůh).
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

64. DeriNet 1.5

Creator:: Vidra, Jonáš, Žabokrtský, Zdeněk, Ševčíková, Magda, Kalužová, Adéla, Mediankin, Nikita, and Straka, Milan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, wordnet, and lexicalConceptualResource
Subject:: DeriNet, derivation, derivational morphology, lexical network, and MorfFlex
Language:: Czech
Description:: DeriNet is a lexical network which models derivational relations in the lexicon of Czech. Nodes of the network correspond to Czech lexemes, while edges represent derivational relations between a derived word and its base word. The present version, DeriNet 1.5, contains 1,011,965 lexemes (sampled from the MorfFlex dictionary) connected by 785,543 derivational links. Besides several rather conservative updates (such as newly identified prefix and suffix verb-to-verb derivations as well as noun-to-adjective derivations manifested by most frequent adjectival suffixes), DeriNet 1.5 is the first version that contains annotations related to compounding (compound words are distinguished by a special mark in their part-of-speech labels).
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

65. DeriNet 1.6 (2018-09-24)

Creator:: Vidra, Jonáš, Kyjánek, Lukáš, Ševčíková, Magda, Žabokrtský, Zdeněk, Kalužová, Adéla, Dohnalová, Šárka, and Hudeček, Vojtěch
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, wordnet, and lexicalConceptualResource
Subject:: DeriNet, derivation, derivational morphology, lexical network, and MorfFlex
Language:: Czech
Description:: DeriNet is a lexical network which models derivational relations in the lexicon of Czech. Nodes of the network correspond to Czech lexemes, while edges represent derivational relations between a derived word and its base word. The present version, DeriNet 1.6, contains 1,027,832 lexemes (sampled from the MorfFlex dictionary) connected by 803,404 derivational links. Furthermore, starting with version 1.5, DeriNet contains annotations related to compounding (compound words are distinguished by a special mark in their part-of-speech labels). Compared to version 1.5, version 1.6 was expanded by extracting potential links from dictionaries available under suitable licences, such as Wiktionary, and by enlarging the number of marked compounds.
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

66. DeriNet 2.0

Creator:: Vidra, Jonáš, Žabokrtský, Zdeněk, Kyjánek, Lukáš, Ševčíková, Magda, and Dohnalová, Šárka
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, wordnet, and lexicalConceptualResource
Subject:: DeriNet, derivation, derivational morphology, lexical network, and MorfFlex
Language:: Czech
Description:: DeriNet is a lexical network which models derivational relations in the lexicon of Czech. Nodes of the network correspond to Czech lexemes, while edges represent derivational or compositional relations between a derived word and its base word / words. The present version, DeriNet 2.0, contains 1,027,665 lexemes (sampled from the MorfFlex dictionary) connected by 808682 derivational and 600 compositional links. Compared to previous versions, version 2.0 uses a new format and contains new types of annotations: compounding, annotation of several morphological and other categories of lexemes, identification of root morphs of 244,198 lexemes, semantic labelling of 151,005 relations using five labels and identification of 13 fictitious lexemes.
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

67. DeriNet 2.1

Creator:: Vidra, Jonáš, Žabokrtský, Zdeněk, Kyjánek, Lukáš, Ševčíková, Magda, Dohnalová, Šárka, Svoboda, Emil, and Bodnár, Jan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: wordnet, text, and lexicalConceptualResource
Subject:: DeriNet, derivation, derivational morphology, lexical network, and MorfFlex
Language:: Czech
Description:: DeriNet is a lexical network which models derivational relations in the lexicon of Czech. Nodes of the network correspond to Czech lexemes, while edges represent word-formational relations between a derived word and its base word / words. The present version, DeriNet 2.1, contains 1,039,012 lexemes (sampled from the MorfFlex CZ 2.0 dictionary) connected by 782,814 derivational, 50,533 orthographic variant, 1,952 compounding, 295 univerbation and 144 conversion relations. Compared to the previous version, version 2.1 contains annotations of orthographic variants, full automatically generated annotation of affix morpheme boundaries (in addition to the roots annotated in 2.0), 202 affixoid lexemes serving as bases for compounding, annotation of corpus frequency of lexemes, annotation of verbal conjugation classes and a pilot annotation of univerbation. The set of part-of-speech tags was converted to Universal POS from the Universal Dependencies project.
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), PUB, and http://creativecommons.org/licenses/by-nc-sa/3.0/

68. Deutsches Rechtswörterbuch (DRW)

Publisher:: University of Heidelberg, Heidelberger Akademie der Wissenschaften
Type:: lexicalConceptualResource
Subject:: Germanistik
Language:: German
Description:: A dictionary of old legal German. Includes words up until 1800. Historisches Wörterbuch; Dokumentation von Rechtswörtern sowie Wörtern mit rechtlichen Bezügen (bis etwa 1800)
Rights:: Not specified

70. Deutsches Wörterbuch - The Free Dictionary

Type:: lexicalConceptualResource
Subject:: Germanistik
Language:: German
Description:: Angabe von grammatischen Informationen, Worterklärungen, typischen (syntaktischen) Verbindungen, idiomatischen Wendungen und Beispielsätzen; Möglichkeit, sich Übersetzungen des jeweiligen Wortes anzeigen zu lassen
Rights:: Not specified

71. Diccionario de neologismos on line

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: lexicalConceptualResource
Subject:: neologisms database
Language:: Spanish
Description:: Lexicographic resource containing 3.530 neologisms documented in press written in Spanish between 1989 and 2007.
Rights:: Not specified

72. Dicionario CLUVI inglés-galego

Publisher:: TALG Research Group (University of Vigo)
Type:: lexicalConceptualResource
Language:: English and Galician
Description:: Corpus-based English-Galician bilingual dictionary
Rights:: Not specified

73. DicoValence

Type:: lexicalConceptualResource
Language:: French
Description:: 3700 entries, text
Rights:: Not specified

74. Dictionary of Carelian (= Karjalan kielen sanakirja)

Publisher:: The Research Institute for the Languages of Finland
Type:: lexicalConceptualResource
Rights:: Not specified

75. Dictionary of Standard Slovenian Language

Type:: lexicalConceptualResource
Language:: Slovenian
Description:: 93.500 entries, XML-compatible
Rights:: Not specified

76. Dictionary of the Scots language

Publisher:: University of Dundee
Type:: lexicalConceptualResource
Language:: English
Description:: Historical dictionary of the Scottish language as written and spoken by lowland Scots in Scotland and Ulster from the 12th century onward. Over eighty thousand full-word entries.
Rights:: Not specified

78. Dictionnaire de l'occitan médiéval (DOM)

Creator:: Claudia, Kraus, Stempel, Wolf-Dieter, Tausend, Monika, and Peter, Renate
Publisher:: Bavarian Academy of Sciences and Humanities and Bayerische Akademie der Wissenschaften
Type:: text, lexicon, and lexicalConceptualResource
Subject:: Emil Levy, Petit Levy, Lexique Roman, DOM, Occitian language, Medieval Occitan, Occitan, Old Occitan, Old Provençal, Romance languages, dictionary, etymology, Middle Ages, troubadours, lexicography, and Supplementwörterbuch
Language:: French and Old Provençal (to 1500)
Description:: In the Middle Ages, Old Occitan (formerly "Old Provençal"), the language of the troubadours, was a literary and cultural language, the influence of which extended far beyond the frontiers of Southern France. The only comprehensive portrayal of the Old Occitan vocabulary to have appeared up to now is the "Lexique roman" by François Raynouard (6 vols., 1836–1845). It was supplemented by Emil Levy’s "Provenzalisches Supplementwörterbuch" (8 vols., 1894–1924). An updated dictionary, taking account of progress in research over the last 100 years, has been the desideratum of literary scholars, linguists, and historians ever since. Under the direction of Wolf-Dieter Stempel, the publication of a new dictionary of Old Occitan, the "Dictionnaire de l'occitan médiéval (DOM)", began in 1996. This appeared in print until 2013, directed from 2012 on by Maria Selig. Since then it has been available as an alphabetically complete digital dictionary, the "DOM en ligne". This comprises the newly written articles of the DOM together with the articles from the dictionaries of Raynouard and Levy for those parts of the alphabet not yet covered by the new work and is enriched by entries for words absent till now from Old Occitan lexicography. Its content is available for free at https://dom-en-ligne.de/dom.php
Rights:: Not specified

79. Digital Listing of Headwors in the Dictionary of Carelian (= Karjalan kielen sanakirja 1-6, 1968-2005)

Publisher:: Research Institute for the Languages of Finland
Type:: lexicalConceptualResource
Rights:: Not specified

80. Digital Listing of the Dictionary of Karelian (Karjalan kielen sanakirjan hakusanaluettelo)

Type:: lexicalConceptualResource
Description:: 94 532 words, XML (data), HTML (interface)
Rights:: Not specified

81. Duden online

Type:: lexicalConceptualResource
Subject:: Germanistik
Language:: German
Description:: Angabe von Rechtschreibung, Bedeutungsübersicht, Synonymen, Aussprache (Audio-Datei), Herkunft, Grammatik, typischen Verbindungen (computergeneriert) sowie Bedeutungen, Beispielen und Wendungen (zusätzlich: Angabe der Wörter, die im Alphabet vorhergehen und nachfolgen)
Rights:: Not specified

82. DWDS-Wörterbuch

Type:: lexicalConceptualResource
Subject:: Germanistik
Language:: German
Description:: setzt sich zusammen aus dem Deutschen Wörterbuch, dem Wörterbuch der Deutschen Gegenwartssprache (WDG) sowie dem Etymologischen Wörterbuch des Deutschen (EtymWb)
Rights:: Not specified

83. Early Irish Glossaries

Publisher:: Department of Anglo-Saxon, Norse, and Celtic at the University of Cambridge
Type:: lexicalConceptualResource
Language:: Irish
Description:: Database of three inter-related early Irish glossaries. The texts, compiled from the eighth century, comprise several thousand headwords followed by entries that can range from single word explanations to whole narratives running to several pages.
Rights:: Not specified

85. Electronic Vepsian Word List

Publisher:: Research Institute for the Languages of Finland
Type:: lexicalConceptualResource
Rights:: Not specified

86. elexiko – Online-Wörterbuch zur deutschen Gegenwartssprache

Publisher:: Institut für Deutsche Sprache
Type:: lexicalConceptualResource
Subject:: Germanistik
Language:: German
Description:: Schwerpunkt: Bedeutungs-/Verwendungsbeschreibung; zusätzlich: Angabe von Orthographie, Worttrennung und grammatischen Informationen; befindet sich noch im Aufbau
Rights:: Not specified

87. elexiko-Wörterbuch (im elexiko-Portal)

Publisher:: Institut für Deutsche Sprache
Type:: lexicalConceptualResource
Language:: German
Description:: XML
Rights:: Not specified

88. English gustatory adjectives and lexical synaesthesia - data analysis

Creator:: Jurčević, Jana
Publisher:: Faculty of Humanities and Social Sciences, University of Rijeka
Type:: text, wordList, and lexicalConceptualResource
Subject:: lexical synaesthesia, metaphorical collocations, metonymy, cross-modal mapping, and embodiment
Language:: English
Description:: Data collection has been done by the means of Sketch Engine program. Data were extrapolated from the annotated English web corpus enTenTen20. Data collection and analysis has been done during the period of two months: April and May 2023. Recently, the enTenTen20 corpus has been updated to a newer version - enTenTen21. Nevertheless, the older version is still available, can be worked on and can be compared with the newer one. It has been noticed that the differences between the two versions of the English web corpus did not affect the results of this study. The only apparent difference was seen in slightly different numbers in frequency values for specific collocations. This was expected since the older version of web corpus consists of 36 billion words, while the new version counts 52 billion words. On the other hand, as noted above, these frequency deviations were not significant enough to refute the hypotheses. They have rather confirmed them once again. This study is one of the results of work on a larger scientific-research project called "Metaphorical collocations - syntagmatic relations between semantics and pragmatics". More information about the project is available on the following link: https://metakol.uniri.hr/en/opis-projekta/ The study has been financed by the Croatian science foundation. Working with the data/replicating the study: Data collected for the purposes of this study is available in CSV format. Data for each gustatory adjective (collocate) is presented in a separate CSV file. Upon opening each file, stretch the borders of every column for better visibility of data. Tables show different collocational bases (nouns) which are found in the corpus, in combination with a specific gustatory adjective, their collocate. These nouns are listed by their score number (The Mutual Information score expresses the extent to which words co-occur compared to the number of times they appear separately). Tables show what type of mapping is present in a certain collocation (e.g., intra-modal or cross-modal). Tables show what type of meaning or cognitive process is working in the background of the meaning formation (e.g., metonymic or metaphoric). For every analyzed collocation, we provided a contextualized example of its use from the corpus, along with the hyperlink where it can be found.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

89. EngVallex - English Valency Lexicon

Creator:: Cinková, Silvie, Fučíková, Eva, Šindlerová, Jana, and Hajič, Jan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, computationalLexicon, and lexicalConceptualResource
Subject:: Annotations, Corpora, Data, Lexicons, Monolingual, Semantics, and Valency
Language:: English
Description:: EngVallex is the English counterpart of the PDT-Vallex valency lexicon, using the same view of valency, valency frames and the description of a surface form of verbal arguments. EngVallex contains links also to PropBank and Verbnet, two existing English predicate-argument lexicons used, i.a., for the PropBank project. The EngVallex lexicon is fully linked to the English side of the PCEDT parallel treebank, which is in fact the PTB re-annotated using the Prague Dependency Treebank style of annotation. The EngVallex is available in an XML format in our repository, and also in a searchable form with examples from the PCEDT.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

90. EngVallex - English Valency Lexicon 2.0

Creator:: Cinková, Silvie, Fučíková, Eva, Šindlerová, Jana, and Hajič, Jan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, computationalLexicon, and lexicalConceptualResource
Subject:: Annotations, corpus, linguistic data, lexicon, lexical semantics, Monolingual, semantics, verbal valency, and valency
Language:: English
Description:: EngVallex 2.0 as a slightly updated version of EngVallex. It is the English counterpart of the PDT-Vallex valency lexicon, using the same view of valency, valency frames and the description of a surface form of verbal arguments. EngVallex contains links also to PropBank (English predicate-argument lexicon). The EngVallex lexicon is fully linked to the English side of the PCEDT parallel treebank(s), which is in fact the PTB re-annotated using the Prague Dependency Treebank style of annotation. The EngVallex is available in an XML format in our repository, and also in a searchable form with examples from the PCEDT. EngVallex 2.0 is the same dataset as the EngVallex lexicon packaged with the PCEDT 3.0 corpus, but published separately under a more permissive licence, avoiding the need for LDC licence which is tied to PCEDT 3.0 as a whole.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

94. Extended Morphosyntactic Testset for Word2Vec

Creator:: Kocmi, Tom and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, other, and lexicalConceptualResource
Subject:: syntactic questions
Language:: English
Description:: We have created test set for syntactic questions presented in the paper [1] which is more general than Mikolov's [2]. Since we were interested in morphosyntactic relations, we extended only the questions of the syntactic type with exception of nationality adjectives which is already covered completely in Mikolov's test set. We constructed the pairs more or less manually, taking inspiration in the Czech side of the CzEng corpus [3], where explicit morphological annotation allows to identify various pairs of Czech words (different grades of adjectives, words and their negations, etc.). The word-aligned English words often shared the same properties. Another sources of pairs were acquired from various webpages usually written for learners of English. For example for verb tense, we relied on a freely available list of English verbs and their morphological variations. We have included 100-1000 different pairs for each question set. The questions were constructed from the pairs similarly as by Mikolov: generating all possible pairs of pairs. This leads to millions of questions, so we randomly selected 1000 instances per question set, to keep the test set in the same order of magnitude. Additionally, we decided to extend set of questions on opposites to cover not only opposites of adjectives but also of nouns and verbs.
Rights:: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB

95. FAspell

Creator:: QasemiZadeh, Behrang
Publisher:: Behrang-QasemiZadeh
Type:: text, wordList, and lexicalConceptualResource
Subject:: spellchecking, spellchecker, and Evaluation Dataset for Automatic Spell Checking
Language:: Persian
Description:: FASpell dataset was developed for the evaluation of spell checking algorithms. It contains a set of pairs of misspelled Persian words and their corresponding corrected forms similar to the ASpell dataset used for English. The dataset consists of two parts: a) faspell_main: list of 5050 pairs collected from errors made by elementary school pupils and professional typists. b) faspell_ocr: list of 800 pairs collected from the output of a Farsi OCR system.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

96. FLEXIKON

Type:: lexicalConceptualResource
Language:: Danish
Description:: 80.000 entries, flat, tab-separated file
Rights:: Not specified

97. Gold Standard Reference Data for Multiword Expression Extraction: Czech Dependency Bigrams from the Prague Dependency Treebank

Creator:: Pecina, Pavel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, lexicalConceptualResource, and computationalLexicon
Subject:: multiword expressions
Language:: Czech
Description:: Annotated list of dependency bigrams occurring in the PDT more than five times and having part-of-speech patterns that can possibly form a collocation. Each bigram is assigned to one of the six MWE categories by three annotators.
Rights:: Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0), http://creativecommons.org/licenses/by-nc/3.0/, and PUB

98. Grammatisches Informationssystem (grammis)

Creator:: Strecker, Bruno, Schneider, Roman, and Konopka, Marek
Publisher:: Institut für Deutsche Sprache
Type:: lexicalConceptualResource
Language:: German
Description:: Web Information System on German grammar – contains e.g. a linked terminological knowledge-base, XML format
Rights:: Not specified

99. Herders Conversations-Lexikon

Type:: lexicalConceptualResource
Subject:: Germanistik
Language:: German
Description:: 1. Aufl. 1854-1857; disziplinübergreifende Darstellung von Gegenstandsbereichen gesellschaftlicher Konversation
Rights:: Not specified

100. INTERA Terminological Lexicon

Type:: lexicalConceptualResource
Language:: Bulgarian, English, Modern Greek (1453-), Serbian, and Slovenian
Description:: 17357 terms, XML
Rights:: Not specified

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Show values starting with

Creator

Show values starting with

Format

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Date

Original context has metadata only

Harvested from