Original context has metadata only: true / Rights: Not specified - LINDAT/CLARIAH-CZ Catalog Search Results

Start Over Rights Not specified Original context has metadata only true

1. A Gold Standard Word Alignment for English-Swedish

Creator:: Ahrenberg, Lars and Holmqvist, Maria
Publisher:: Linköping University
Type:: text, wordList, and lexicalConceptualResource
Subject:: word alignment
Language:: Swedish and English
Description:: A Gold Standard Word Alignment for English-Swedish (GES) is a resource containing 1164 manually word aligned sentences pairs from English and Swedish versions of Europarl v. 2. The data can be found here: https://www.ida.liu.se/labs/nlplab/ges/
Rights:: Not specified

2. A simplified front-end for SemTi-Kamols morphological analyser

Publisher:: Institute of Mathematics and Computer Science, University of Latvia
Type:: toolService
Subject:: morphological analyzer
Language:: Latvian
Description:: A simplified front-end (in a form of a RESTful web service) of the SemTi-Kamols morphological analyzer. Mainly for demonstration purposes.
Rights:: Not specified

3. ABC - Language Identifier

Publisher:: Research Institute for Artificial Intelligence, Romanian Academy of Sciences
Type:: toolService
Description:: The application, developed in C#, automatically identifies the language of a text written in one of the 21 European Union languages. By using training texts in different languages (approx. 1.5Mb of text for each language), a training module counts the prefixes (the first 3 characters) and the suffixes (4 characters endings) for all the words in the texts, for each language. For every language two models are constructed, containing the weights (percentages) of prefixes and suffixes in the texts representing a language. In the prediction phase, for a new text, two models are built on the fly in a similar manner. These models are then compared with the stored models representing each language for which the application was trained. Using comparison functions, the best model is chose. More detailed descriptions are available in [[http://www.racai.ro/~tufis/papers|the following papers]]: -- Dan Tufiş, Radu Ion, Alexandru Ceauşu, and Dan Ştefănescu (2008). RACAI's Linguistic Web Services. In Proceedings of the 6th Language Resources and Evaluation Conference - LREC 2008, Marrakech, Morocco, May 2008. ELRA - European Language Resources Association. ISBN 2-9517408-4-0. -- Dan Tufiş and Alexandru Ceauşu (2007). Diacritics Restoration in Romanian Texts. In Elena Paskaleva and Milena Slavcheva (eds.), A Common Natural Language Processing Paradigm for Balkan Languages - RANLP 2007 Workshop Proceedings, pp. 49-56, Borovets, Bulgaria, September 2007. INCOMA Ltd., Shoumen, Bulgaria. ISBN 978-954-91743-8-0. -- Dan Tufiş and Adrian Chiţu (1999). Automatic Insertion of Diacritics in Romanian Texts. In Ferenc Kiefer, Gábor Kiss, and Júlia Pajzs (eds.), Proceedings of the 5th International Workshop on Computational Lexicography (COMPLEX 1999), pp. 185-194, Pecs, Hungary, May 1999. Linguistics Institute, Hungarian Academy of Sciences.
Rights:: Not specified

4. Access rights Management System

Publisher:: Max Planck Institute for Psycholinguistics
Type:: toolService
Description:: A tool to grant and deny the access to (parts of) an IMDI-based corpus. Support for advanced settings like ACLs.
Rights:: Not specified

5. Álgu – Origins of Saami Words

Publisher:: The Research Institute for the Languages of Finland
Type:: lexicalConceptualResource
Language:: Northern Sami
Description:: The database will contain an etymological lexicon of Saami languages complete with detailed source citations. The database will be open to the public in November 2006 and will be updated regularly.
Rights:: Not specified

6. Álgu – Origins of Saami Words (Álgu – Saamen sanojen etymologinen tietokanta)

Type:: lexicalConceptualResource
Description:: 70,000 words, over 100,000 etymological relations, Relational database
Rights:: Not specified

7. Alpino Treebank

Publisher:: Center for Language and Cognition
Format:: application/xml
Type:: corpus
Language:: Dutch
Description:: A database of 7.000 syntactically analyzed Dutch sentences.
Rights:: Not specified

8. ALTWEB

Type:: corpus
Language:: Italian
Description:: Dialect (Tuscan); 380.000 entries; written; DBT tagset
Rights:: Not specified

9. Amara - universal subtitles

Type:: corpus
Language:: Arabic, Danish, Dutch, English, German, Modern Greek (1453-), Italian, Japanese, Korean, Portuguese, Russian, Spanish, and Turkish
Description:: Large set of subtitles available for download in multiple languages. Can be used as parallel corpus.
Rights:: Not specified

10. Anglos-Saxon charters

Publisher:: King's College London
Format:: application/tei+xml
Type:: corpus
Language:: English
Description:: Charters written in Anglo-Saxon England before A.D. 900, marked-up in TEI XML. Browsable online.
Rights:: Not specified

11. Annex - Annotation Exploration tool

Publisher:: Max Planck Institute for Psycholinguistics
Type:: toolService
Description:: tool in the MPI web-based framework for archive exploration (and enrichment)
Rights:: Not specified

12. ANNIS

Publisher:: University of Potsdam, Dept. of Linguistics and Humboldt-University Berlin, Institut für deutsche Sprache und Linguistik
Type:: toolService
Description:: ANNIS2 is an open source, versatile web browser-based search and visualization architecture for complex multilevel linguistic corpora with diverse types of annotation. ANNIS, which stands for ANNotation of Information Structure, has been designed to provide access to the data of the SFB 632 - "Information Structure: The Linguistic Means for Structuring Utterances, Sentences and Texts". Since information structure interacts with linguistic phenomena on many levels, ANNIS2 addresses the SFB's need to concurrently annotate, query and visualize data from such varied areas as syntax, semantics, morphology, prosody, referentiality, lexis and more. For project working with spoken language, support for audio / video annotations is also required.
Rights:: Not specified

13. Anotatornia

Publisher:: Institute of Computer Science, Polish Academy of Sciences
Type:: toolService
Description:: Tool for manual on-line annotation of corpora at various linguistic levels. The levels currently implemented are: word-level and sentence-level segmentation, morphosyntax, word sense disambiguation. Anotatornia implements sophisticated mechanisms of the management of texts, annotators and conflicts.
Rights:: Not specified

14. Apertium Old Catalan morphological analyzer

Publisher:: Universidad de Alicante
Type:: toolService
Subject:: morphological analyzer
Language:: Catalan
Description:: A RESTful morphological analyzer for Old Catalan.
Rights:: Not specified

15. Aquén - Toponimia galega

Publisher:: TALG Research Group (University of Vigo)
Type:: lexicalConceptualResource
Language:: Galician
Description:: Galician Toponymy Database, 40,000 entries
Rights:: Not specified

16. Arabic ACL corpus

Creator:: Salah Elfahal Elebaed, Hoyam, Kasbi, Mohammed, Nasri, Mohammed, and Bouzoubaa, Karim
Publisher:: International Journal of Computer Science Trends and Technology (IJCST)
Type:: text and corpus
Subject:: Controlled Natural Language, Arabic CNL, ACL, Arabic Corpus, and and TEI.
Language:: Arabic
Description:: This corpus constitutes all sentences representing the Arabic Controlled Language (ACL). It contains 551 sentences taken from four textbooks and websites dedicated to teach Arabic language to kids such as: a) First grade book, Republic of Sudan (كتاب الصف الاول جمهورية السودان), b) Al Jazeera Educational Site (موقع الجزيرة التعليمي), c) Bella Preparatory School Girls Forum (منتدى مدرسة بيلا الاعدادية بنات), and d) Albahr website (موقع انا البحر). These sentences are respecting 52 ACL rules. The average number of sentences for each rule is 10.6. All sentences in the corpus were analyzed by Farasa syntactic parser to confirm they are correctly analyzed. The validity of the parsing was done manually by linguist experts. The structure of this corpus is made of a header and a body. The header consists of a set of metadata that describe the corpus, such as the corpus name, the authors, the sources and further meta data. While the header is made of metadata, the body contains rules. Each rule has a code, a structure and all sentences respecting that rule. For each sentence, we store an id, the vowelledand unvowelled text as well as the result of parsing using Farasa.
Rights:: Not specified

17. Araucaria

Publisher:: School of Computing, University of Dundee
Type:: toolService
Subject:: argument analyzer
Description:: Araucaria is a software tool for analysing arguments. It aids a user in reconstructing and diagramming an argument using a simple point-and-click interface. The software also supports argumentation schemes, and provides a user-customisable set of schemes with which to analyse arguments. Written in Java, released under the GNU General Public License.
Rights:: Not specified

18. Arborest

Type:: corpus
Language:: Estonian
Description:: 149 sentences, VISL tagset
Rights:: Not specified

19. Arts and Humanities Data Service Literature, Languages and Linguistics

Type:: corpus
Language:: English
Description:: Electronic texts, corpora, lexicons. other
Rights:: Not specified

20. Assigning lemmas and part-of-speech to wordform lists

Type:: toolService
Language:: Slovenian
Description:: online service
Rights:: Not specified

21. Atlas of Place Names

Publisher:: The Research Institute for the Languages of Finland
Type:: toolService
Language:: Finnish
Description:: The digital atlas illustrates the distribution of 234 common Finnish place-name elements based on data in the Names Archive.
Rights:: Not specified

22. Audio and video database of Latvian folklore

Publisher:: Archives of Latvian Folklore, Institute of Literature, Folklore and Art, University of Latvia
Format:: application/octet-stream
Type:: corpus
Language:: Latvian
Description:: The database contains audio and video material related to traditional culture - songs, folktales, legends, life stories and various collective or individual folklore related performances. The content has been either specifically contributed to the Archives of Latvian Folklore or collected by its staff members.
Rights:: Not specified

23. Audio Recordings Archive

Publisher:: The Research Institute for the Languages of Finland
Type:: corpus
Language:: Finnish
Description:: The Audio Recordings Archive (Suomen kielen nauhoitearkisto) holds over 23,000 hours of recordings collected since 1959, providing authentic samples of Finnish dialects, languages related to Finnish, and other world languages. The collection additionally includes samples of Finnish dialects spoken in Sweden, Norway, Ingria, the United States and Australia. Digitisation of the audio bank was undertaken in 1999. Over half of its content has been digitised, totalling about 13,000 hours of recordings.
Rights:: Not specified

24. BABEL Estonian Database

Publisher:: Institute of Cybernetics at Tallinn University of Technology
Type:: corpus
Language:: Estonian
Description:: The database consists of three sets: - Many Talker Set: 30 males, 30 females; each to read 50 numbers, 1-2 connected passages, 1 block of "filler" sentences, and 1 block of syllables. - Few Talker Set: 4 males, 4 females; each to read 50 numbers, 10 connected passages, 1 block of "filler" sentences, and 2-3 blocks of syllables. - Very Few Talker Set: 1 male, 1 female; each to read 2 blocks of 50 numbers, 40 connected passages, 4 blocks of "filler" sentences, and 9 blocks of syllables. Total amount ca 12 hours of speech.
Rights:: Not specified

25. Banco de neologismos 2004-2007

Publisher:: Instituto Cervantes and Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: lexicalConceptualResource
Subject:: neologisms database
Language:: Catalan
Description:: Repository of neologisms (15.375 entries)
Rights:: Not specified

26. Base de synonymes CRISCO

Type:: lexicalConceptualResource
Language:: French
Description:: 49.000, RDB
Rights:: Not specified

27. Basic vocabulary on the Human Genome

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: lexicalConceptualResource
Language:: Catalan, English, French, Galician, Italian, Portuguese, and Spanish
Description:: A vocabulary resulting from the cooperation of the groups of REALITER network that collects the basic terminology mostly used in texts about Genomics. It contains equivalents in English, Peninsular and Latinamerican Spanish, French, Italian, Galician, Portuguese and Catalan.
Rights:: Not specified

28. Bavaria's Dialects Online

Creator:: Raaf, Manuel
Publisher:: Bayerische Akademie der Wissenschaften and Bavarian Academy of Sciences and Humanities
Type:: text, machineReadableDictionary, and lexicalConceptualResource
Subject:: dictionary, web dictionary, Dialektologie, dialect variation, language variation, Dialectology, dialectology, Bavarian, Bavaria, Swabian, Frankish, Franconian Language, and spoken language
Language:: German, Bavarian, Swabian, and Frankish
Description:: Bavaria's Dialects Online (BDO) is the digital language information system of the three projects "Bavarian Dictionary", "Franconian Dictionary", and "Dialectological Information System of Bavarian Swabia". The database combines the research results of dialect research and presents dictionary articles as well as research data in a freely accessible online tool. BDO is not only aimed at scholars, but also at the lay public interested in the language. Here, the vocabulary of all Bavarian dialects is collected in one place and made accessible. The system shows the richness of the dialects of Bavaria in combination. With the new database, one will be able to compare the dialect vocabulary of Old Bavaria, Franconia and Swabia. Authentic dialect evidence is used to illustrate the dialect words in their variety of meanings and regional distribution, as well as to show their use in idioms, proverbs, and much more. BDO allows a whole new look at the vocabulary of the dialects of all parts of the state of Bavaria.
Rights:: Not specified

29. Berliner Wendekorpus

Publisher:: Berlin-Brandenburg Academy of Sciences and Humanities
Format:: application/tei+xml
Type:: corpus
Language:: German
Description:: Transcribed narrative interviews with people from East and West Berlin about the events of November 9. 282,000 tokens. TEI XML, lemma and POS. Normalized version also available.
Rights:: Not specified

30. Bibliografie zur deutschen Grammatik (BDG)

Publisher:: Institut für Deutsche Sprache
Type:: toolService
Language:: German
Description:: Online Bibliography, bibliographic database
Rights:: Not specified

31. Bibliotheca Augustana / Bibliotheca Germanica

Publisher:: Hochschule Augsburg
Type:: corpus
Subject:: Germanistik
Language:: German
Description:: Chronology of German literature (Old High German literature, Middle High German literature, Early New High German literature, New High German literature); Chronologie der deutschen Literatur (alt-, mittel-, frühneu-, neuhochdeutsche Literatur)
Rights:: Not specified

32. Bilder-Conversations-Lexikon

Type:: lexicalConceptualResource
Subject:: Germanistik
Language:: German
Description:: digitale Ausgabe der ersten Auflage des "Bilder-Conversations-Lexikons für das deutsche Volk" (1837-1841); "Handbuch zur Verbreitung gemeinnütziger Kenntnisse und zur Unterhaltung" (Selbstbeschreibung im Vorwort); beinhaltet zahlreiche Abbildungen und Landkarten
Rights:: Not specified

33. Bilingual English-Lithuanian, Lithuanian-English, Czech-Lithuanian, Lithuanian-Czech corpora

Publisher:: Center of Computational Linguistics, Vytautas Magnus University
Format:: application/xml
Type:: corpus
Language:: Czech, English, and Lithuanian
Description:: A collection of parallel corpora: English-Lithuanian (2m words), Lithuanian-English (0,06m words), Czech-Lithuanian (0,8m words), Lithuanian-Czech (0,02m words). All the corpora are online-searcheable via one interface at http://donelaitis.vdu.lt/main_en.php?id=4&nr=1_2. The corpus is still being updated with new texts.
Rights:: Not specified

34. BitPar

Creator:: Schmid, Helmut
Publisher:: University of Stuttgart
Type:: toolService
Subject:: parser
Description:: Statistical parser
Rights:: Not specified

35. Blingual Language Acquisition Julka Corpus

Publisher:: Max Planck Institute for Psycholinguistics
Type:: corpus
Language:: German and Polish
Description:: Language Acquisition corpus
Rights:: Not specified

36. BNF Converter

Publisher:: Språkbanken, Dept. of Swedish Language, Göteborg University
Type:: toolService
Subject:: compiler construction and grammar
Description:: The BNF Converter is a compiler construction tool generating a compiler front-end from a Labelled BNF grammar.
Rights:: Not specified

37. Bochumer Mittelhochdeutsch-Korpus

Publisher:: Ruhr-Universität Bochum
Type:: corpus
Subject:: Germanistik
Language:: German
Description:: Verses, prose and certificates from Middle High German; mittelhochdeutsche Verse, Prosastücke und Urkunden
Rights:: Not specified

38. Bokmålsordboka

Publisher:: Department of Linguistics and Nordic Studies, University of Oslo
Type:: lexicalConceptualResource
Description:: 65 000 entries with definitions, etymology, examples
Rights:: Not specified

39. Bonner Frühneuhochdeutschkorpus (FnhdC)

Publisher:: Korpora.org and Fakultät Geisteswissenschaften, Universität Duisburg-Essen
Type:: corpus
Subject:: Germanistik
Language:: German
Description:: Digital, morphologically annotated (N, V, A) part of the Bonn Corpus of Early New High German; used to create the Grammatik des Frühneuhochdeutschen (III. Nouns; IV. Verbs; VI. Adjectives); morphologisch annotiert; Materialgrundlage für die Erarbeitung der Bände 3, 4 und 6 der "Grammatik des Frühneuhochdeutschen"
Rights:: Not specified

40. Botanicus Digital Library

Type:: corpus
Subject:: Germanistik
Language:: Chinese, Czech, English, French, German, Latin, and Spanish
Description:: Digital copies of historical botanic papers from the Missouri Botanical Garden Library; Bilddigitalisate von historischen botanischen Schriften; deutschsprachige Texte stellen nur einen Teilbereich dar
Rights:: Not specified

41. British academic spoken English (BASE) corpus

Publisher:: Coventry University, University of Reading, University of Warwick
Format:: application/tei+xml
Type:: corpus
Language:: English
Description:: Transcribed recordings of 160 lectures and 39 seminars held in university departments. Four broad disciplinary groups, 1,644,942 tokens in total.
Rights:: Not specified

42. British National Corpus

Type:: corpus
Language:: English
Description:: General reference corpus; 100 million words; POS, lemma, descriptive metadata
Rights:: Not specified

43. Brockhaus' Kleines Konversations-Lexikon

Type:: lexicalConceptualResource
Subject:: Germanistik
Language:: German
Description:: 5. Aufl. 1911; Fokus auf Politik, Wirtschaft, Kultur und Technik zu Beginn des 20. Jahrhunderts
Rights:: Not specified

44. Budapest Sociolinguistic Interview (BSI)

Publisher:: Academy of Sciences
Type:: corpus
Language:: Hungarian
Description:: BSI is a large-scale survey which provides reliable data on and analyses of the varieties of Hungarian spoken in Budapest.
Rights:: Not specified

45. Bulgarian CLEF Corpus

Type:: corpus
Language:: Bulgarian
Description:: Written, synchronic, general (newspapers)
Rights:: Not specified

46. Bulgarian-Croatian Comparable Corpus

Type:: corpus
Language:: Bulgarian and Croatian
Description:: written; domain-specific (newspaper); diachronic; bilingual; comparable; ca 3,500,000 tokens (393 Kw Bulgarian; 3.1 Mw Croatian)
Rights:: Not specified

47. BulTreeBank

Type:: corpus
Language:: Bulgarian
Description:: HPSG-based annotation including: constituent structure, dependency relations, named entities (classified as person, organisation, location or other names), coreferential relations. Annotation in XML
Rights:: Not specified

48. BulTreeBank Frequency List

Type:: lexicalConceptualResource
Language:: Bulgarian
Description:: 100 000 most frequent Cyrillic tokens in the BulTreeBank text archive, UTF-16 list of token-frequency pairs
Rights:: Not specified

49. BulTreeBank Morphological Analyzer

Creator:: Simov, Kiril and Osenova, Petya
Publisher:: Linguistic Modeling Department, IPP, Bulgarian Academy of Sciences
Type:: toolService
Description:: It is used morphological lexicon of Bulgarian (100 000 lemmas) compiled as a finite-state automaton in CLaRK System. It requires the text to be first tokenized and it is applied in each token. Includes also guessers for unknown words and Named Entities gazetteers. If the corresponding resources are available for a different language, then it can be tuned to it.
Rights:: Not specified

50. BulTreeBank Morphosyntactic Corpus

Type:: corpus
Language:: Bulgarian
Description:: Written, synchronic, general, manually annotated, 1 000 000 tokens divided in three sets: 215 000 tokens used in BulTreeBank HPSG Treebank (see below), additionally 300 000 checked second time, rest about 480 000 checked by the annotators. Morphosyntactic annotation with the BulTreeBank Tagset (http://www.bultreebank.org/TechRep/BTB-TR03.pdf), XML, annotation description in technical reports of BulTreeBank project http://www.bultreebank.org/TechRep
Rights:: Not specified

51. BulTreeBank Morphosyntactic Disambiguator

Creator:: Simov, Kiril, Osenova, Petya, and Simov, Alex
Publisher:: Linguistic Modeling Department, IPP, Bulgarian Academy of Sciences
Type:: toolService
Description:: This is a hybrid system: rules, neural network, rules. First rules for the sure cases are applied, then a neural network disambiguator is applied, then rules for repairing of the most frequent errors of the neural network. The rules are implemented as constraints in CLaRK System. The neural network is additional module implemented in Java. It is called CLaRK. It requires the morphologically annotated input.
Rights:: Not specified

52. BulTreeBank POS Corpus

Type:: corpus
Language:: Bulgarian
Description:: Written, synchronic, general, manually annotated; 50 000 tokens, 2600 sentences extracted from the BulTreeBank Text Archive in order to contain the most frequent ambiguity classes in Bulgarian
Rights:: Not specified

53. BulTreeBank Stopword List

Type:: lexicalConceptualResource
Language:: Bulgarian
Description:: 805 prepositions, pronouns, etc stop words, UTF-16 list of wordforms
Rights:: Not specified

54. BulTreeBank Text Archive

Type:: corpus
Language:: Bulgarian
Description:: 72 000 000 tokens, 15% fiction, 78% newspapers and 7% legal texts, government bulletins and others
Rights:: Not specified

55. BulTreeBank Tokenizer

Creator:: Simov, Kiril
Publisher:: Linguistic Modeling Department, IPP, Bulgarian Academy of Sciences
Type:: toolService
Description:: The tokenizer is covering all languages that use Latin1, Laitn2, Latin3 and Cyrillic tables of Unicode. Can be extended to cover other tables in Unicode if necessary. The implementation is as a cascaded regular grammar in CLaRK. It recognizes over 60 token categories. It is easy to be adapted to new token categories.
Rights:: Not specified

56. BUSCANEO

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Language:: Catalan and Spanish
Description:: Tool for neologism extraction.
Rights:: Not specified

57. Bústia Neològica Escolar

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Language:: Catalan and Spanish
Description:: Terminology management
Rights:: Not specified

58. Bwananet

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Language:: Catalan, English, and Spanish
Description:: Tool for querying the Technical Corpus of the Institut Universitari de Lingüística Aplicada.
Rights:: Not specified

59. calcular_p_cue_class

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Description:: Statistical analysis service: It calculates P(cue|class): probability of seeing a linguistic cue given a lexical class. This probability is computed given the occurrences of cues in a corpus (codified in the signatures file) and the information of belonging or not belonging of these words to different classes (codified in indicators file). The probability is computed for each studied cue in the signatures file and for each class in the indicators file.
Rights:: Not specified

60. canoonet – Deutsche Wörterbücher und Grammatik

Type:: lexicalConceptualResource
Subject:: Germanistik
Language:: German
Description:: Angabe von orthographischen, morphologischen (Wortformenbildung und Wortbildung) sowie semantischen Informationen (Synonymie; Hyperonymie/Hyponymie); Zuordnung der Wörter zu der jeweiligen syntaktischen Kategorie (bei Substantiven zusätzlich Angabe des Genus)
Rights:: Not specified

61. Carib (Karinya) corpus

Publisher:: Royal Institute of Linguistics and Anthropology, Leiden, The Netherlands
Format:: audio/x-wav
Type:: corpus
Description:: The data on the Carib language is collected by dr. Berend Hoff in the period 1955-1965. See: B.J. Hoff, The Carib Language, Phonology, Morphology, Text and Word Index. Verhandelingen van het Koninklijk Instituut voor Taal-, Land-, en Volkenkunde (Royal Institute of Linguistics and Anthropology) Vol. 55 (1968), Martinus Nijhoff: The Hague.
Rights:: Not specified

62. Carl-Maria-von-Weber-Gesamtausgabe (WeGA)

Publisher:: Musikwissenschaftliches Seminar Detmold/Paderborn and Staatsbibliothek zu Berlin (Musikabteilung)
Type:: corpus
Subject:: Germanistik
Language:: German
Description:: Weber's diary entries, letters, writings, and opera; Tagebücher, Briefe, Schriften und Werke Webers
Rights:: Not specified

63. CAST corpus (Computer-Aided Summarisation Tool)

Publisher:: Research Group in Computational Linguistics, University of Wolverhampton
Type:: corpus
Language:: English
Description:: Sentences annotated for important units of text for summarisation. 145,473 words / 6584 sentences
Rights:: Not specified

64. Catalan Annotated Corpora CQP

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Description:: This RESTful service allows to define a sub-corpus from different annotated corpora. The service includes a POS tag harmonisation process where original tags are converted to EAGLES/Parole format. The eventual sub-corpus is indexed using the IMS CWB tool. The user receives an ID which can be used by the CQP service to exploit the sub-corpus.
Rights:: Not specified

65. Catalan Digital Press

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Description:: This RESTful service accesses part of the Hemeroteca Digital de l’Arxiu Municipal de Girona (digital press archive from the Girona city council), specifically Catalan press from 2003. The service uses the SRU protocol.
Rights:: Not specified

66. catdoc

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Description:: Format conversion service: Word .doc to .txt converter
Rights:: Not specified

67. CELEX (web version)

Publisher:: Max Planck Institute for Psycholinguistics
Type:: lexicalConceptualResource
Language:: Dutch, English, and German
Rights:: Not specified

68. CELT Corpus of Electronic Texts

Publisher:: University College, Cork
Format:: application/tei+xml
Type:: corpus
Language:: English, Irish, and Latin
Description:: searchable online corpus of multilingual texts of Irish literature and history
Rights:: Not specified

69. Cercador NEOROM

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Language:: Catalan and Spanish
Description:: Search engine for the neologisms database of the NEOROM network. The network collects neologisms used in the press written in Romance languages from 2005 onwards.
Rights:: Not specified

70. Cercador OBNEO

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Description:: Search engine of the BOBNEO data bank, a database of neologisms present in the mass media in Spanish and Catalan, written and oral, from 1992.
Rights:: Not specified

71. Česílko 2.0 Shallow Transfer RBMT framework (opensource version)

Creator:: Vičič, Jernej, Kuboň, Vladislav, and Homola, Petr
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: tool and toolService
Subject:: Shallow Parse, Shallow Transfer Rule-Based Machine Translation, stochastic ranker, related languages, and toolbox
Description:: The system Česílko (language data and software tools) was first developed as an answer to a growing need of translation and localisation from one source language to many target languages. The starting system belonged to the Shallow Parse, Shallow Transfer Rule-Based Machine Translation – (RBMT) paradigm and it was designed primarily for translation of related languages. The latest implementation of the system uses a stochastic ranker; so technically it belongs to the hybrid machine translation paradigm, using stochastic methods combined with the traditional Shallow Transfer RBMT methods. The system has been stripped of the accompanying language resources due to copyright restrictions. The data that is available is just for demonstrative purposes.
Rights:: Not specified

72. Cesilko Web Service for Weblicht

Creator:: Hajič, Jan, Kuboň, Vladislav, and Homola, Petr
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: toolService and service
Subject:: machine translation
Description:: Weblicht integration of Cesilko (http://hdl.handle.net/11858/00-097C-0000-0006-AAFE-A)
Rights:: Not specified

73. CLaRK System - an XML-based system for Corpora Development

Publisher:: Linguistic Modeling Department, IPP, Bulgarian Academy of Sciences
Type:: toolService
Subject:: corpus development
Description:: The CLaRK System incorporates several technologies: - XML technology - Unicode - Cascaded Regular Grammars; - Constraints over XML Documents On the basis of these technologies the following tools are implemented: XML Editor, Unicode Tokeniser, Sorting tool, Removing and Extracting tool, Concordancer, XSLT tool, Cascaded Regular Grammar tool, etc. 1 Unicode tokenization In order to provide possibility for imposing constraints over the textual node and to segment them in meaningful way, the CLaRK System supports a user-defined hierarchy of tokenisers. At the very basic level the user can define a tokeniser in terms of a set of token types. In this basic tokeniser each token type is defined by a set of UNICODE symbols. Above this basic level tokenisers, the user can define other tokenisers, for which the token types are defined as regular expressions over the tokens of some other tokeniser, the so called parent tokeniser. 2 Regular Grammars The regular grammars are the basic mechanism for linguistic processing of the content of an XML document within the system. The regular grammar processor applies a set of rules over the content of some elements in the document and incorporates the categories of the rules back in the document as XML mark-up. The content is processed before the application of the grammar rules in the following way: textual nodes are tokenized with respect to some appropriate tokeniser, the element nodes are textualized on the basis of XPath expressions that determine the important information about the element. The recognized word is substituted by a new XML mark-up, which can or can not contain the word. 3 Constraints The constraints that we implemented in the CLaRK System are generally based on the XPath language. We use XPath expressions to determine some data within one or several XML documents and thus we evaluate some predicates over the data. There are two modes of using a constraint. In the first mode the constraint is used for validity check, similar to the validity check, which is based on DTD or XML schema. In the second mode, the constraint is used to support the change of the document in order it to satisfy the constraint. There are three types of constraints, implemented in the system: regular expression constraints, number restriction constraints, value restriction constraints. 4 Macro Language In the CLaRK System the tools support a mechanism for describing their settings. On the basis of these descriptions (called queries) a tool can be applied only by pointing to a certain description record. Each query contains the states of all settings and options which the corresponding tool has. Once having this kind of queries there is a special tool for combining and applying them in groups (macros). During application the queries are executed successively and the result from an application is an input for the next one. For a better control on the process of applying several queries in one we introduce several conditional operators. These operators can determine the next query for application depending on certain conditions. When a condition for such an operator is satisfied, the execution continues from a location defined in the operator. The mechanism for addressing queries is based on user defined labels. When a condition is not satisfied the operator is ignored and the process continues from the position following the operator. In this way constructions like IF-THEN-ELSE and WHILE-DO easily can be expressed. The system supports five types of control operators: IF (XPath): the condition is an XPath expression which is evaluated on the current working document. If the result is a non-empty node-set, non-empty string, positive number or true boolean value the condition is satisfied; IF NOT (XPath): the same kind of condition as the previous one but the approving result is negated; IF CHANGED: the condition is satisfied if the preceding operation has changed the current working document or has produced a non-empty result document (depending on the operation); IF NOT CHANGED: the condition is satisfied if either the previous operation did not change the working document or did not produce a non-empty result. GOTO: unconditional changing the execution position. Each macro defined in the system can have its own query and can be incorporated in another macro. In this way some limited form of subroutine can be implemented. The new version of CLaRK will support server applications, calls to/from external programs.
Rights:: Not specified

74. CLaRK System - XML-based system for Corpora Development

Creator:: Simov, Kiril, Simov, Alex, and Kouylekov, Milen
Publisher:: Linguistic Modeling Department, IPP, Bulgarian Academy of Sciences
Type:: toolService
Description:: The CLaRK System incorporates several technologies: - XML technology - Unicode - Cascaded Regular Grammars; - Constraints over XML Documents On the basis of these technologies the following tools are implemented: XML Editor, Unicode Tokeniser, Sorting tool, Removing and Extracting tool, Concordancer, XSLT tool, Cascaded Regular Grammar tool, etc. 1 Unicode tokenization In order to provide possibility for imposing constraints over the textual node and to segment them in meaningful way, the CLaRK System supports a user-defined hierarchy of tokenisers. At the very basic level the user can define a tokeniser in terms of a set of token types. In this basic tokeniser each token type is defined by a set of UNICODE symbols. Above this basic level tokenisers, the user can define other tokenisers, for which the token types are defined as regular expressions over the tokens of some other tokeniser, the so called parent tokeniser. 2 Regular Grammars The regular grammars are the basic mechanism for linguistic processing of the content of an XML document within the system. The regular grammar processor applies a set of rules over the content of some elements in the document and incorporates the categories of the rules back in the document as XML mark-up. The content is processed before the application of the grammar rules in the following way: textual nodes are tokenized with respect to some appropriate tokeniser, the element nodes are textualized on the basis of XPath expressions that determine the important information about the element. The recognized word is substituted by a new XML mark-up, which can or can not contain the word. 3 Constraints The constraints that we implemented in the CLaRK System are generally based on the XPath language. We use XPath expressions to determine some data within one or several XML documents and thus we evaluate some predicates over the data. There are two modes of using a constraint. In the first mode the constraint is used for validity check, similar to the validity check, which is based on DTD or XML schema. In the second mode, the constraint is used to support the change of the document in order it to satisfy the constraint. There are three types of constraints, implemented in the system: regular expression constraints, number restriction constraints, value restriction constraints. 4 Macro Language In the CLaRK System the tools support a mechanism for describing their settings. On the basis of these descriptions (called queries) a tool can be applied only by pointing to a certain description record. Each query contains the states of all settings and options which the corresponding tool has. Once having this kind of queries there is a special tool for combining and applying them in groups (macros). During application the queries are executed successively and the result from an application is an input for the next one. For a better control on the process of applying several queries in one we introduce several conditional operators. These operators can determine the next query for application depending on certain conditions. When a condition for such an operator is satisfied, the execution continues from a location defined in the operator. The mechanism for addressing queries is based on user defined labels. When a condition is not satisfied the operator is ignored and the process continues from the position following the operator. In this way constructions like IF-THEN-ELSE and WHILE-DO easily can be expressed. The system supports five types of control operators: IF (XPath): the condition is an XPath expression which is evaluated on the current working document. If the result is a non-empty node-set, non-empty string, positive number or true boolean value the condition is satisfied; IF NOT (XPath): the same kind of condition as the previous one but the approving result is negated; IF CHANGED: the condition is satisfied if the preceding operation has changed the current working document or has produced a non-empty result document (depending on the operation); IF NOT CHANGED: the condition is satisfied if either the previous operation did not change the working document or did not produce a non-empty result. GOTO: unconditional changing the execution position. Each macro defined in the system can have its own query and can be incorporated in another macro. In this way some limited form of subroutine can be implemented. The new version of CLaRK will support server applications, calls to/from external programs.
Rights:: Not specified

75. CLIPS : corpora e lessici di italiano parlato e scritto

Publisher:: Università degli studi di Napoli Federico II
Type:: corpus
Language:: Italian
Description:: Audio files of about 100 hours of speech from 15 different cities in Italy. Various recordings are transcribed to read in PDF
Rights:: Not specified

76. Code-switching conversation corpus

Publisher:: Max Planck Institute for Psycholinguistics
Type:: corpus
Language:: Dutch
Description:: The code-switching corpus consists of 5x30-minute conversations between four speakers (i.e. a total of 20 speakers). The speakers are bilingual speakers of Papiamento (a creole langauge spoken in the Dutch Antilles) and Dutch. In the course of their free conversations, they engage in code-switching, that is, they use both languages within the same utterance in systematic ways. The corpus is fully transcribed and glossed, coded for language and word class, in ELAN.
Rights:: Not specified

77. COLDIC

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Description:: Tool for dictionary management
Rights:: Not specified

78. Collection of Latvian literature

Publisher:: Tilde
Type:: corpus
Language:: Latvian
Description:: Masterpieces of Latvian literature from the beginning of Latvian literature until first decades of 20th century
Rights:: Not specified

79. Collection of Latvian proverbs

Publisher:: Archives of Latvian Folklore, Institute of Literature, Folklore and Art, University of Latvia and Institute of Mathematics and Computer Science, University of Latvia
Type:: corpus
Language:: Latvian
Description:: Latvian proverbs collected by Archives of Latvian Folklore (~ 20 000 items)
Rights:: Not specified

80. Cologne Digital Sanskrit Dictionaries

Publisher:: Institute of Indology and Tamil Studies, Cologne University
Type:: lexicalConceptualResource
Language:: Sanskrit
Description:: Sanskrit lexicons. The data is made available as scanned images of the works as well as a digitization of the scanned images, which permits computer-aided analyses and displays of the work. Can be downloaded or queried online.
Rights:: Not specified

81. COLT – The Bergen Corpus of London Teenage Language

Type:: corpus
Language:: English
Description:: British English (London); Spoken, general, age-specific dialect corpus; 500 000 words, 55 hrs of recording; POS, speaker/conversation metainfo
Rights:: Not specified

82. COMPARA : Portuguese - English parallel translation corpus

Type:: corpus
Language:: English and Portuguese
Description:: bi-directional parallel corpus based on an open-ended collection of Portuguese-English and English-Portuguese source-texts and translations. Searchable via the IMS Corpus Query Processor and the DISPARA interface
Rights:: Not specified

83. Comparable Russian-Finnish corpus of juridical texts

Publisher:: University of Tampere
Format:: application/octet-stream
Type:: corpus
Language:: Finnish and Russian
Description:: Juridical texts in Russian and Finnish arranged as a comparable text corpus
Rights:: Not specified

84. Complete Corpus of Anglo-Saxon Poetry

Format:: text/plain
Type:: corpus
Language:: English
Description:: Plain-text electronic editions of Old English poems
Rights:: Not specified

85. Complete sagas of Icelanders

Type:: corpus
Language:: English
Description:: New English translations of the entire corpus of the sagas of Icelanders and connected tales
Rights:: Not specified

86. Concise dictionary of Latvian

Publisher:: Institute of Mathematics and Computer Science, University of Latvia
Format:: application/xml
Type:: lexicalConceptualResource
Language:: Latvian
Description:: > 25 000 entries
Rights:: Not specified

87. ConsILR - Consortium for the Romanian Language: Resources & Tools

Type:: lexicalConceptualResource
Language:: English and Romanian
Description:: Resources and tools developed for Romanian
Rights:: Not specified

88. Constraint Grammar of Estonian

Type:: languageDescription
Language:: Estonian
Description:: general written, Constraint Grammar
Rights:: Not specified

89. Conversations-Lexikon oder kurzgefaßtes Handwörterbuch

Type:: lexicalConceptualResource
Subject:: Germanistik
Language:: German
Description:: 1. Aufl. 1809-1811; Darstellung der Gegenstandsbereiche gesellschaftlicher Konversation; Berücksichtigung bedeutender historischer Ereignisse
Rights:: Not specified

90. CORP-ORAL Spontaneous Speech Corpus

Publisher:: Instituto de Linguística Teórica e Computacional
Type:: corpus
Language:: Portuguese
Description:: The aim of the CORP-ORAL project is to build a corpus of spontaneous European Portuguese speech available for the training of speech synthesis and recognition systems as well as phonetic, phonological, lexical, morphological and syntactic studies. The corpus contains the recording of 60 hours of conversations between two European Portuguese speakers per conversation (at a time). The entire corpus will be completed with orthographic transcription and the prosodic marking of speech breaks/boundaries as well as phonetic transcription of a selection of chunks. CORP-ORAL is built from scratch with the explicit goal of becoming entirely available on the internet to the scientific community and the public in general.
Rights:: Not specified

91. Corpus "Miljons"

Publisher:: Institute of Mathematics and Computer Science, University of Latvia
Format:: text/plain
Type:: corpus
Subject:: balanced corpus
Language:: Latvian
Description:: Balanced corpus of Modern Latvian (~ 1 million running words, currently in plain-text), publicly available via Bonito interface
Rights:: Not specified

92. Corpus "Plāns ledus"

Publisher:: Institute of Mathematics and Computer Science, University of Latvia
Format:: text/sgml
Type:: corpus
Language:: Latvian
Description:: Morphologically tagged and lemmatized text sample (> 16 000 running words), publicly available via Bonito interface and http://www.korpuss.lv/uzzinas/plans_ledus.pdf
Rights:: Not specified

93. Corpus arborée du français

Type:: corpus
Language:: French
Description:: 800.000 words, POS and syntax, proprietary XML
Rights:: Not specified

94. Corpus bilingüe d’alternança de llengües (codeswitching)

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: corpus
Subject:: speech corpus
Language:: Catalan, English, and Spanish
Description:: 8 interactive recordings of group dynamics. Bilingual speakers (L1 -> English; L1 -> Catalan/Spanish).
Rights:: Not specified

95. Corpus bilingüe de La Canonja en temps aparent (TA)

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: corpus
Subject:: oral corpus and bilingual
Language:: Catalan and Spanish
Description:: Bilingual oral corpus containing 30 life histories/sociolinguistic interviews (17 in Catalan and 13 in Spanish) carried out in La Canonja (Tarragona).
Rights:: Not specified

96. Corpus CLUVI

Publisher:: TALG Research Group (University of Vigo)
Type:: corpus
Language:: Basque, Catalan, English, French, Galician, German, Portuguese, and Spanish
Description:: Parallel corpus, 22 million words
Rights:: Not specified

97. Corpus d’entrevistes sociolingüístiques a castellanoparlants i catalanoparlants en temps aparent (TA)

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: corpus
Subject:: oral corpus
Language:: Catalan and Spanish
Description:: Oral corpus containing 15 sociolinguistic interviews in Spanish and Catalan, carried out by means of Labovian techniques.
Rights:: Not specified

98. Corpus d’extractes de gravacions d’Internet en temps aparent (TA) i temps real (TR) amb finalitats forenses

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: corpus
Subject:: corpus
Language:: English
Rights:: Not specified

99. Corpus de narratives d’angloparlants immigrats a Espanya en temps aparent (TA)

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: corpus
Language:: English
Description:: Oral corpus containing 166 narratives in English elicited by means of Labovian techniques. Participants from the UK (England, Wales, Scotland), Ireland, USA, Australia and South Africa.
Rights:: Not specified

100. Corpus de parlants catalanoparlants de La Canonja en temps real (TR)

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: corpus
Subject:: oral corpus
Language:: Catalan
Description:: Oral corpus containing 10 sociolinguistic interviews carried out in La Canonja (Tarragona).
Rights:: Not specified

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Show values starting with

Creator

Show values starting with

Format

Language

Show values starting with

Publisher

Show values starting with

Rights

Subject

Show values starting with

Type

Show values starting with

Date

Original context has metadata only

Harvested from