Language: English / Rights: Not specified - LINDAT/CLARIAH-CZ Catalog Search Results

Start Over Language English Rights Not specified

1. A Gold Standard Word Alignment for English-Swedish

Creator:: Ahrenberg, Lars and Holmqvist, Maria
Publisher:: Linköping University
Type:: text, wordList, and lexicalConceptualResource
Subject:: word alignment
Language:: Swedish and English
Description:: A Gold Standard Word Alignment for English-Swedish (GES) is a resource containing 1164 manually word aligned sentences pairs from English and Swedish versions of Europarl v. 2. The data can be found here: https://www.ida.liu.se/labs/nlplab/ges/
Rights:: Not specified

2. Amara - universal subtitles

Type:: corpus
Language:: Arabic, Danish, Dutch, English, German, Modern Greek (1453-), Italian, Japanese, Korean, Portuguese, Russian, Spanish, and Turkish
Description:: Large set of subtitles available for download in multiple languages. Can be used as parallel corpus.
Rights:: Not specified

3. Anglos-Saxon charters

Publisher:: King's College London
Format:: application/tei+xml
Type:: corpus
Language:: English
Description:: Charters written in Anglo-Saxon England before A.D. 900, marked-up in TEI XML. Browsable online.
Rights:: Not specified

4. Arts and Humanities Data Service Literature, Languages and Linguistics

Type:: corpus
Language:: English
Description:: Electronic texts, corpora, lexicons. other
Rights:: Not specified

5. Basic vocabulary on the Human Genome

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: lexicalConceptualResource
Language:: Catalan, English, French, Galician, Italian, Portuguese, and Spanish
Description:: A vocabulary resulting from the cooperation of the groups of REALITER network that collects the basic terminology mostly used in texts about Genomics. It contains equivalents in English, Peninsular and Latinamerican Spanish, French, Italian, Galician, Portuguese and Catalan.
Rights:: Not specified

6. Bilingual English-Lithuanian, Lithuanian-English, Czech-Lithuanian, Lithuanian-Czech corpora

Publisher:: Center of Computational Linguistics, Vytautas Magnus University
Format:: application/xml
Type:: corpus
Language:: Czech, English, and Lithuanian
Description:: A collection of parallel corpora: English-Lithuanian (2m words), Lithuanian-English (0,06m words), Czech-Lithuanian (0,8m words), Lithuanian-Czech (0,02m words). All the corpora are online-searcheable via one interface at http://donelaitis.vdu.lt/main_en.php?id=4&nr=1_2. The corpus is still being updated with new texts.
Rights:: Not specified

7. Botanicus Digital Library

Type:: corpus
Subject:: Germanistik
Language:: Chinese, Czech, English, French, German, Latin, and Spanish
Description:: Digital copies of historical botanic papers from the Missouri Botanical Garden Library; Bilddigitalisate von historischen botanischen Schriften; deutschsprachige Texte stellen nur einen Teilbereich dar
Rights:: Not specified

8. British academic spoken English (BASE) corpus

Publisher:: Coventry University, University of Reading, University of Warwick
Format:: application/tei+xml
Type:: corpus
Language:: English
Description:: Transcribed recordings of 160 lectures and 39 seminars held in university departments. Four broad disciplinary groups, 1,644,942 tokens in total.
Rights:: Not specified

9. British National Corpus

Type:: corpus
Language:: English
Description:: General reference corpus; 100 million words; POS, lemma, descriptive metadata
Rights:: Not specified

10. Bwananet

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Language:: Catalan, English, and Spanish
Description:: Tool for querying the Technical Corpus of the Institut Universitari de Lingüística Aplicada.
Rights:: Not specified

11. CAST corpus (Computer-Aided Summarisation Tool)

Publisher:: Research Group in Computational Linguistics, University of Wolverhampton
Type:: corpus
Language:: English
Description:: Sentences annotated for important units of text for summarisation. 145,473 words / 6584 sentences
Rights:: Not specified

12. CELEX (web version)

Publisher:: Max Planck Institute for Psycholinguistics
Type:: lexicalConceptualResource
Language:: Dutch, English, and German
Rights:: Not specified

13. CELT Corpus of Electronic Texts

Publisher:: University College, Cork
Format:: application/tei+xml
Type:: corpus
Language:: English, Irish, and Latin
Description:: searchable online corpus of multilingual texts of Irish literature and history
Rights:: Not specified

14. COLT – The Bergen Corpus of London Teenage Language

Type:: corpus
Language:: English
Description:: British English (London); Spoken, general, age-specific dialect corpus; 500 000 words, 55 hrs of recording; POS, speaker/conversation metainfo
Rights:: Not specified

15. COMPARA : Portuguese - English parallel translation corpus

Type:: corpus
Language:: English and Portuguese
Description:: bi-directional parallel corpus based on an open-ended collection of Portuguese-English and English-Portuguese source-texts and translations. Searchable via the IMS Corpus Query Processor and the DISPARA interface
Rights:: Not specified

16. Complete Corpus of Anglo-Saxon Poetry

Format:: text/plain
Type:: corpus
Language:: English
Description:: Plain-text electronic editions of Old English poems
Rights:: Not specified

17. Complete sagas of Icelanders

Type:: corpus
Language:: English
Description:: New English translations of the entire corpus of the sagas of Icelanders and connected tales
Rights:: Not specified

18. ConsILR - Consortium for the Romanian Language: Resources & Tools

Type:: lexicalConceptualResource
Language:: English and Romanian
Description:: Resources and tools developed for Romanian
Rights:: Not specified

19. Corpus bilingüe d’alternança de llengües (codeswitching)

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: corpus
Subject:: speech corpus
Language:: Catalan, English, and Spanish
Description:: 8 interactive recordings of group dynamics. Bilingual speakers (L1 -> English; L1 -> Catalan/Spanish).
Rights:: Not specified

20. Corpus CLUVI

Publisher:: TALG Research Group (University of Vigo)
Type:: corpus
Language:: Basque, Catalan, English, French, Galician, German, Portuguese, and Spanish
Description:: Parallel corpus, 22 million words
Rights:: Not specified

21. Corpus d’extractes de gravacions d’Internet en temps aparent (TA) i temps real (TR) amb finalitats forenses

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: corpus
Subject:: corpus
Language:: English
Rights:: Not specified

22. Corpus de narratives d’angloparlants immigrats a Espanya en temps aparent (TA)

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: corpus
Language:: English
Description:: Oral corpus containing 166 narratives in English elicited by means of Labovian techniques. Participants from the UK (England, Wales, Scotland), Ireland, USA, Australia and South Africa.
Rights:: Not specified

23. Corpus of Early English Correspondence Sampler (CEECS)

Publisher:: University of Helsinki
Format:: text/plain
Type:: corpus
Language:: English
Description:: Personal correspondence from England between the years 1418-1680. Compiled as a tool for historical sociolinguistics.
Rights:: Not specified

24. Corpus Tècnic de l'IULA

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: corpus
Language:: Catalan, English, and Spanish
Description:: domain specific corpus (Law, Economy, Computing, Medicine and Environment as well as a contrastive corpus from the press); EN 3.3 M tokens, SP 33 M tokens, CAT 19 M tokens; EAGLEs pos tagset
Rights:: Not specified

25. CorpusExplorer

Creator:: Rüdiger, Jan Oliver
Publisher:: Jan Oliver Rüdiger
Type:: tool and toolService
Subject:: Corpus Linguisitics, NLP, conll, tei, XML, nlp, Natural Language Processing, linguistics, Linguistics, Computational Linguistics, corpus processing, tagger, POS tagger, lemmatization, text cleaning, CommonCrawl, epub, JSON, Twitter, Pandoc, Wikipedia, digital data, DTA, DSpin, MySQL, ElasticSearch, TextGrid, text corpora, TigerXML, and WeblichtXML
Language:: German, English, French, Italian, Dutch, Spanish, Polish, Arabic, Chinese, and Portuguese
Description:: Software for corpus linguists and text/data mining enthusiasts. The CorpusExplorer combines over 45 interactive visualizations under a user-friendly interface. Routine tasks such as text acquisition, cleaning or tagging are completely automated. The simple interface supports the use in university teaching and leads users/students to fast and substantial results. The CorpusExplorer is open for many standards (XML, CSV, JSON, R, etc.) and also offers its own software development kit (SDK). Source code available at https://github.com/notesjor/corpusexplorer2.0
Rights:: Not specified

26. Croatian-English Parallel Corpus

Publisher:: University of Zagreb, Faculty of Humanities and Social Sciences
Type:: corpus
Language:: Croatian and English
Description:: written; domain-specific (newspaper); synchronic; bilingual; parallel; unidirectional; XML; S-alignment
Rights:: Not specified

27. CST's lemmatiser

Publisher:: Center for Sprogteknologi, University of Copenhagen
Type:: toolService
Language:: Danish, Dutch, English, German, Modern Greek (1453-), Icelandic, Norwegian, Russian, Slovenian, and Swedish
Description:: 1) Fully automatic rule based lemmatization of inflected languages 2) Fully automatic training of lemmatization rules based on full form-lemma list
Rights:: Not specified

28. Dependency Grammars

Publisher:: Universitat de Barcelona
Type:: languageDescription
Subject:: dependency grammar
Language:: Catalan, English, and Spanish
Description:: Dependency grammars
Rights:: Not specified

29. Diachronic Corpus of Present-Day Spoken English (DCPSE)

Publisher:: Survey of English Usage, University College London
Type:: corpus
Language:: English
Description:: A parsed corpus of spoken English. Ca 400,000 words from ICE-GB (early 1990s) and 400,000 words from the London-Lund Corpus (late 1960s-early 1980s). The orthographic transcriptions have been normalised and annotated.
Rights:: Not specified

30. Dicionario CLUVI inglés-galego

Publisher:: TALG Research Group (University of Vigo)
Type:: lexicalConceptualResource
Language:: English and Galician
Description:: Corpus-based English-Galician bilingual dictionary
Rights:: Not specified

31. Dictionary of the Scots language

Publisher:: University of Dundee
Type:: lexicalConceptualResource
Language:: English
Description:: Historical dictionary of the Scottish language as written and spoken by lowland Scots in Scotland and Ulster from the 12th century onward. Over eighty thousand full-word entries.
Rights:: Not specified

32. DPC (Dutch Parallel Corpus)

Publisher:: Katholieke Universiteit Leuven Campus Kortrijk, Hogeschool Gent
Type:: corpus
Language:: Dutch, English, and French
Description:: Parallel corpus, with Dutch as first language, 10 M words (under construction). DPC is a STEVIN-project.
Rights:: Not specified

33. English Gigaword

Publisher:: Linguistic Data Consortium (LDC)
Type:: corpus
Language:: English
Description:: 3rd edition contains millions of words from 6 different news wires.
Rights:: Not specified

34. English-Bulgarian INTERA

Type:: corpus
Language:: Bulgarian and English
Description:: Alignment – TMX, structural – XCES, morphosyntactic – XCES, MTE tagset
Rights:: Not specified

35. English-Latvian SMT system

Publisher:: Institute of Mathematics and Computer Science, University of Latvia
Type:: toolService
Language:: English
Description:: English-Latvian factored SMT system uses Moses decoder, trained on JRC-Acquis and some other parallel texts
Rights:: Not specified

36. English-Lithuanian Machine Translation Service

Publisher:: Center of Computational Linguistics, Vytautas Magnus University
Type:: toolService
Language:: English and Lithuanian
Description:: On-line freely accessible machine translation tool for translating English webpages or texts into Lithuanian.
Rights:: Not specified

37. English-Luganda Parallel Corpus

Publisher:: Center for Dutch Language and Speech, University of Antwerp
Type:: corpus
Language:: English
Description:: Bible. Word-alligned corpus
Rights:: Not specified

38. Estonian-English parallel corpus

Type:: corpus
Language:: English and Estonian
Description:: written EU legislation; 5 mio words Est, 7.8 mio words Eng; Sentence-aligned
Rights:: Not specified

39. Eurotermbank

Publisher:: Tilde and Eurotermbank consortium
Format:: application/octet-stream
Type:: lexicalConceptualResource
Language:: English, Estonian, French, German, Hungarian, Latvian, and Lithuanian
Description:: EuroTermBank is single access point to European multilingual terminology resources. It contains more than 1.9 million terms over 25 languages
Rights:: Not specified

40. EUSTACE : Edinburgh University speech timing archive and corpus of English

Publisher:: Centre for Speech Technology Research, University of Edinburgh
Type:: corpus
Language:: English
Description:: Speech corpus comprising 4608 spoken sentences recorded for speech timing research. The complete archive, available for downloading, includes a structured list of the sentences, the speech recordings and the label files, plus full documentation.
Rights:: Not specified

41. FreeLing

Publisher:: Centro de Tecnologías y Aplicaciones del Lenguaje y del Habla (TALP)
Type:: toolService
Language:: Catalan, English, Galician, Italian, Portuguese, and Welsh
Description:: Open source language analysis tool suite: tokenizer, stemmer/lemmatizer, named entity recognizer, chunker/segmenter, morphosyntactic tagger, syntactic tagger, corpus processer, morphological tagger, semantic tagger, analyzer, Word Sense Disambiguator.
Rights:: Not specified

42. Helsinki Corpus of British English Dialects

Publisher:: University of Helsinki
Format:: text/plain
Type:: corpus
Language:: English
Description:: Collection of orthographically transcribed audio recorded speech, mainly from East Anglia and the South-West, with a minor collection from Lancashire. The recordings were made in the 1970s and the 1980s by Finnish postgraduates.
Rights:: Not specified

43. Helsinki Corpus of English Texts (HC)

Publisher:: University of Helsinki
Format:: text/plain
Type:: corpus
Language:: English
Description:: A balanced multi-genre corpus of English texts between the years c. 730-1710.
Rights:: Not specified

44. Helsinki Corpus of Older Scots (HCOS)

Publisher:: University of Helsinki
Format:: text/plain
Type:: corpus
Language:: English
Description:: A balanced multi-genre corpus modelled on the Helsinki Corpus, covering the years 1450-1700.
Rights:: Not specified

45. ICLE International Corpus of Learner English

Publisher:: Centre for English Corpus Linguistics, Université catholique de Louvain
Type:: corpus
Language:: English
Description:: over 3 million words of writing by learners of English from 14 different mother tongue backgrounds
Rights:: Not specified

46. IJS-ELAN

Type:: corpus
Language:: English and Slovenian
Description:: parallel, mixed text; 2x0.5 mil. words; TEI / morphosyntactic tags
Rights:: Not specified

47. INTERA Terminological Lexicon

Type:: lexicalConceptualResource
Language:: Bulgarian, English, Modern Greek (1453-), Serbian, and Slovenian
Description:: 17357 terms, XML
Rights:: Not specified

48. International Corpus of English: East Africa (ICE-EA)

Publisher:: Technische Universität, Chemnitz , Universität Bayreuth
Type:: corpus
Subject:: corpus
Language:: English
Description:: One million words of spoken and written English from Kenya and Tanzania. Part of the ICE project
Rights:: Not specified

49. International Corpus of English: Great Britain (ICE-GB)

Publisher:: Survey of English Usage, University College London
Type:: corpus
Language:: English
Description:: One million words of written and spoken English from Great Britain. Transcriptions aligned with digitised speech recordings. POS-tagged and parsed. Part of the International Corpus of English project. Custom-made search software: ICE-CUP
Rights:: Not specified

50. International Corpus of English: Hong Kong (ICE-HK)

Publisher:: Department of Linguistics, The University of Hong Kong
Type:: corpus
Language:: English
Description:: One million words of spoken and written Hong Kong English produced after 1989. Part of the ICE project.
Rights:: Not specified

51. International Corpus of English: India (ICE-Ind)

Publisher:: Shivaji University , Freie Universitat Berlin
Type:: corpus
Language:: English
Description:: One million words of spoken and written English from India. Part of the ICE project
Rights:: Not specified

52. International Corpus of English: UK (ICE-GB)

Publisher:: Survey of English Usage, University College London
Type:: corpus
Language:: English
Description:: 1 million words spoken and written English from UK. POS-tagged and parsed. Digitised speech recordings aligned w text. Part of the International Corpus of English (ICE).
Rights:: Not specified

53. IViE corpus: English Intonation in the British Isles

Publisher:: Phonetics Laboratory, University of Oxford
Type:: corpus
Language:: English
Description:: 36 hours of speech recordings of nine urban varieties of UK English, collected among 16-year-olds in secondary schools. Part of the corpus has been prosodically transcribed.
Rights:: Not specified

54. JIRS

Publisher:: Grid and High Performance Computing Group, ITACA, Universidad Politécnica de Valencia and Universidad de Alicante
Type:: toolService
Language:: Arabic, English, French, Italian, Oromo, and Urdu
Description:: JIRS is a Passage Retrieval system specially suited for Question Answering. It could be adapted to others languages very easily. ask (Written Language): Information Retrieval Applications Question/Answering Environment: OS-independent Access: GPLv3
Rights:: Not specified

55. JRC-Acquis

Publisher:: Joint Research Centre of the EU
Type:: corpus
Language:: Bulgarian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Modern Greek (1453-), Hungarian, Italian, Latvian, Maltese, Norwegian, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, and Swedish
Description:: The largest parallel corpus, contains EU law, the Acquis Communautaire in 22 languages.
Rights:: Not specified

56. Kacenka : parallel corpus of English and Czech texts

Publisher:: Masaryk University, Brno
Type:: corpus
Language:: Czech and English
Description:: Parallel corpus, 3,297,283 words. The idea was to create a small parallel corpus which would enable to work with entire texts in translation analysis rather then short extracts. At the same time it aimed at acquiring experience that could be used in creating a larger parallel corpus of English and Czech in the future. Although the main part of work has been completed -- and the aims of the KACENKA grant met -- we keep improving and enlarging KACENKA gradually. Currently, it has the size of 3,297,283 words (out of which, 1,689,513 have been acquired by means of scanning). Most of the English texts for KACENKA have been retrieved from the Internet resources. The rest -- and nearly all the Czech texts -- had to be scanned with the use of an OCR programme. KACENKA is stored on a single CD-ROM; its use is limited by copyright restrictions.
Rights:: Not specified

57. KIAP - Cultural Identity in Academic Prose

Type:: corpus
Language:: English, French, and Norwegian
Description:: Comparable corpus, written, academic prose; 450 reviewed scientific papers; 3,2 million words; POS
Rights:: Not specified

58. Kicktionary

Type:: lexicalConceptualResource
Language:: English, French, and German
Description:: Electronic dictionary of football language, using FrameNet and WordNet approaches
Rights:: Not specified

59. L1 Acquisition Penelope Brown Rossel

Publisher:: Max Planck Institute for Psycholinguistics
Type:: corpus
Language:: English
Description:: Language Acquisition corpus
Rights:: Not specified

60. L2 Acquisition Barbara Schmiedtova

Publisher:: Max Planck Institute for Psycholinguistics
Type:: corpus
Language:: Czech, English, German, and Vietnamese
Description:: Language Acquisition corpus
Rights:: Not specified

61. L2 Acquisition Finiteness and Scope

Publisher:: Max Planck Institute for Psycholinguistics
Type:: corpus
Language:: Dutch, English, French, and German
Description:: Language Acquisition corpus
Rights:: Not specified

62. LAC English Corpus

Publisher:: Max Planck Institute for Psycholinguistics
Type:: corpus
Language:: English
Description:: Language and Cognition corpus
Rights:: Not specified

63. LCsum (Document Summarizer)

Publisher:: Centro de Tecnologías y Aplicaciones del Lenguaje y del Habla (TALP)
Type:: toolService
Language:: Catalan, English, and Spanish
Description:: Document summarizer.
Rights:: Not specified

64. MEBA word aligner

Creator:: Tufiş, Dan and Ceauşu, Alexandru
Publisher:: Research Institute for Artificial Intelligence, Romanian Academy of Sciences
Type:: toolService
Subject:: word aligner
Language:: English and Romanian
Description:: MEBA is a lexical aligner, implemented in C#, based on an iterative algorithm that uses pre-processing steps: sentence alignment ([[http://www.clarin.eu/tools/sal-sentence-aligner|SAL]]), tokenization, POS-tagging and lemmatization (through [[http://www.clarin.eu/tools/ttl-tokenizing-tagging-and-lemmatizing-free-running-texts|TTL]], sentence chunking. Similar to YAWA aligner, MEBA generates the links step by step, beginning with the most probable (anchor links). The links to be added at any later step are supported or restricted by the links created in the previous iterations. The aligner has different weights and different significance thresholds on each feature and iteration. Each of the iterations can be configured to align different categories of tokens (named entities, dates and numbers, content words, functional words, punctuation) in decreasing order of statistical evidence. MEBA has an individual F-measure of 81.71% and it is currently integrated in the platform [[http://www.clarin.eu/tools/cowal-combined-word-aligner|COWAL]]. More detailed descriptions are available in [[http://www.racai.ro/~tufis/papers|the following papers]]: -- Dan Tufiş (2007). Exploiting Aligned Parallel Corpora in Multilingual Studies and Applications. In Toru Ishida, Susan R. Fussell, and Piek T.J.M. Vossen (eds.), Intercultural Collaboration. First International Workshop (IWIC 2007), volume 4568 of Lecture Notes in Computer Science, pp. 103-117. Springer-Verlag, August 2007. ISBN 978-3-540-73999-9. -- -- Dan Tufiş, Radu Ion, Alexandru Ceauşu, and Dan Ştefănescu (2006). Improved Lexical Alignment by Combining Multiple Reified Alignments. In Toru Ishida, Susan R. Fussell, and Piek T.J.M. Vossen (eds.), Proceedings of the 11th Conference EACL2006, pp. 153-160, Trento, Italy, April 2006. Association for Computational Linguistics. ISBN 1-9324-32-61-2. -- Dan Tufiş, Radu Ion, Alexandru Ceauşu, and Dan Ştefănescu (2005). Combined Aligners. In Proceedings of the ACL Workshop on Building and Using Parallel Texts: Data-Driven Machine Translation and Beyond, pp. 107-110, Ann Arbor, USA, June 2005. Association for Computational Linguistics. ISBN 978-973-703-208-9.
Rights:: Not specified

65. Memory-Based Shallow Parser (MBSP)

Publisher:: ILK, Tilburg University and CNTS - Language Technology Group, University of Antwerp
Type:: toolService
Language:: Dutch and English
Description:: MBSP is a set of linguistic tools based on the TiMBL and MBT memory based learning applications developed at CNTS and ILK. It provides tools for Part of Speech tagging, Chunking, Lemmatizing, Relation Finding, Named Entity Recognition, and (for medical language) Semantic tagging.
Rights:: Not specified

66. MorphoDiTa: Morphological Dictionary and Tagger

Creator:: Straka, Milan and Straková, Jana
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: toolService and tool
Subject:: tagging, morphological analysis, morphological generation, and tokenization
Language:: English
Description:: MorphoDiTa: Morphological Dictionary and Tagger is an open-source tool for morphological analysis of natural language texts. It performs morphological analysis, morphological generation, tagging and tokenization and is distributed as a standalone tool or a library, along with trained linguistic models. In the Czech language, MorphoDiTa achieves state-of-the-art results with a throughput around 10-200K words per second. MorphoDiTa is a free software under LGPL license and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions.
Rights:: Not specified

67. Moses Web Demo

Creator:: Bojar, Ondřej, Cífka, Ondřej, Pecina, Pavel, and Tamchyna, Aleš
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: toolService and tool
Subject:: machine translation, web service, and demo
Language:: Czech, English, Russian, Ukrainian, French, and German
Description:: An interactive web demo of selected ÚFAL MT systems. and FP7-ICT-2011-7-288487 MosesCore
Rights:: Not specified

68. MPI ESF Corpus

Type:: corpus
Language:: Dutch, English, French, German, and Swedish
Description:: Corpus of the ESF Foreign Language Speakers project; almost perfect structurefor IEI; completely metadata described; lots of annotated audio recordings containing multimodal interaction;
Rights:: Not specified

69. Multilingual Central Repository

Publisher:: Centro de Tecnologías y Aplicaciones del Lenguaje y del Habla (TALP)
Type:: lexicalConceptualResource
Subject:: lexical database
Language:: Basque, Catalan, English, Galician, and Spanish
Description:: Multilingual lexical database that follows the model proposed by the EuroWordNet project. The MCR integrates into the same EuroWordNet framework wordnets from five different languages (together with four English WordNet versions). It also integrates WordNet Domains and new versions of the Base Concepts and Top Concept Ontology. Overall, it contains 1,642,389 semantic relations between synsets, most of them acquired by automatic means. Information contained: semantics, synonyms, antonyms, definition, equivalents, example of use, morphology.
Rights:: Not specified

70. Multilingual corpus of juridical texts

Publisher:: University of Tampere
Format:: application/octet-stream
Type:: corpus
Subject:: parallel corpus and multilingual
Language:: English, German, Russian, and Swedish
Description:: International conventions and treaties arranged as a paralell corpus aligned on paragraph level
Rights:: Not specified

71. Multilingualism Marianne Gullberg & Peter Indefrey

Publisher:: Max Planck Institute for Psycholinguistics
Type:: corpus
Language:: Dutch, German, English, and French
Description:: Language Acquisition corpus
Rights:: Not specified

72. MUSA Multilingual Multimodal Corpus

Type:: corpus
Language:: English, French, and Modern Greek (1453-)
Description:: Multilingual (EN, EL, FR); multimodal (Video, Text); parallel (EN, EL, FR subtitles); comparable (transcripts, subtitles); 120 hours
Rights:: Not specified

73. NameTag

Creator:: Straka, Milan and Straková, Jana
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: toolService and tool
Subject:: named entity recognizer
Language:: English
Description:: NameTag is an open-source tool for named entity recognition (NER). NameTag identifies proper names in text and classifies them into predefined categories, such as names of persons, locations, organizations, etc. NameTag is distributed as a standalone tool or a library, along with trained linguistic models. In the Czech language, NameTag achieves state-of-the-art performance (Straková et al. 2013). NameTag is a free software under LGPL license and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions.
Rights:: Not specified

74. NameTag service description

Creator:: Straková, Jana and Straka, Milan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: service and toolService
Subject:: named entity recognition, NameTag, and WeblichtXML
Language:: Czech, German, English, Spanish, and Dutch
Description:: Metadata description of nametag (http://hdl.handle.net/11234/1-3633, https://lindat.mff.cuni.cz/services/nametag/) provided for weblicht.
Rights:: Not specified

75. Namur Corpus

Publisher:: Katholieke Universiteit Leuven Campus Kortrijk
Type:: corpus
Language:: Dutch, English, and French
Description:: Trilingual parallel corpus, with Dutch as first language. 2M words, aligned at paragraph level. It includes fiction and non-fiction texts.
Rights:: Not specified

76. Newcastle electronic corpus of Tyneside English (NECTE)

Publisher:: Newcastle University
Format:: application/tei+xml
Type:: corpus
Language:: English
Description:: A corpus of dialect speech from Tyneside in North-East England. digitized audio, standard orthographic transcription, phonetic transcription, and part-of-speech tagged
Rights:: Not specified

77. Oxford Text Archive

Type:: corpus
Language:: English
Description:: Electronic texts, corpora, lexicons. other
Rights:: Not specified

78. Project Gutenberg

Type:: corpus
Language:: Danish, Dutch, English, Finnish, French, German, Italian, Latin, Portuguese, Russian, Spanish, Swedish, and Telugu
Description:: Possibility to download or to browse free electronic books; Angebot: Download von und Online-Zugang zu frei verfügbaren E-Books; deutschsprachige Literatur stellt nur einen Teilbereich der verfügbaren E-Books dar
Rights:: Not specified

79. Romanian-English dictionary

Type:: lexicalConceptualResource
Language:: English and Romanian
Description:: 38,000 entries, XML
Rights:: Not specified

80. Run (Russian meets Norwegian )

Publisher:: Department of Literature, Area Studies and European Languages, University of Oslo and Department of Linguistics and Nordic Studies, University of Oslo
Type:: corpus
Language:: English, Norwegian, and Russian
Description:: The RuN corpus is a parallel corpus consisting of Norwegian, Russian and English texts. The texts are aligned at the sentence level and have been tagged for grammatical information at the word level.
Rights:: Not specified

81. Scottish corpus of texts and speech (SCOTS)

Publisher:: University of Glasgow
Type:: corpus
Language:: English
Description:: Written and spoken (20%) texts for the languages of Scotland. Ca 4 mwd. Orthographic transcriptions are synchronised with the source audio or video.
Rights:: Not specified

82. SenTube

Publisher:: Machine Learning and NLP group at Trento
Type:: corpus
Subject:: sentiment analysis
Language:: English and Italian
Description:: Sentiment analysis of Youtube videos with joint models of text and speech
Rights:: Not specified

83. Speech, Thought and Writing Presentation Corpus

Publisher:: Lancaster University
Format:: text/plain
Type:: corpus
Language:: English
Description:: A corpus of approximately 260,000 words of modern British narrative texts representing three text types (fiction, newpapers, biography) with detailed annotation for all forms of speech, thought and writing presentation which occur in the corpus. Available via OTA.
Rights:: Not specified

84. SpeechDat-Car databases

Type:: corpus
Language:: Danish, Dutch, English, Finnish, French, German, Modern Greek (1453-), Italian, and Spanish
Description:: 9 speech databases for training and testing multilingual speech recognition applications in the car environment. Contains parallel 4 channel in-car recordings and a GSM channel. Contains interesting phonetically rich material. All orthographically transcribed. Speaker information included for gender, age, accent. Including pronunciation lexicon.
Rights:: Not specified

85. Speecon databases

Type:: corpus
Language:: Czech, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Polish, Portuguese, Russian, Spanish, Swedish, Turkish, Chinese, Hebrew, Japanese, Korean, and Thai
Description:: 28 speech databases containing broadband recordings from 550 adults and 50 children per language. Contains interesting phonetically rich material. All orthographically transcribed. Speaker information included for gender, age, accent. Including pronunciation lexicon.
Rights:: Not specified

86. Subtitle Word Frequencies

Publisher:: Center for Reading Research, Ghent University
Type:: lexicalConceptualResource
Language:: Chinese, Dutch, English, German, Modern Greek (1453-), and Spanish
Rights:: Not specified

87. SVMTool

Publisher:: Centro de Tecnologías y Aplicaciones del Lenguaje y del Habla (TALP)
Type:: toolService
Language:: Catalan, English, and Spanish
Description:: Generator of sequential taggers based on Support Vector Machines.
Rights:: Not specified

88. SynSemClass Search Tool

Creator:: Petliak, Nataliia, Hajič, Jan, Urešová, Zdeňka, and Fučíková, Eva
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: tool and toolService
Subject:: ontology, lexical semantics, search tool, and language resource
Language:: English, Czech, and German
Description:: The SynSemClass Search Tool provides a web search tool for the SynSemClass 5.0 ontology. It includes several search options and criteria for building complex queries. The search results are rendered in a clear and user-friendly interactive representation.
Rights:: Not specified

89. TeLeMaCo

Publisher:: Universität des Saarlandes
Type:: toolService
Subject:: documentation
Language:: Catalan, Dutch, English, French, German, and Italian
Description:: A collection of pointers to teaching and learning materials on linguistics and linguistic tools, including quick starts, how-tos, technical documentation, short teaching modules (2h), and full courses. This resource is collaboratively built by its users.
Rights:: Not specified

90. Termoteca

Publisher:: TALG Research Group (University of Vigo)
Type:: lexicalConceptualResource
Language:: English, French, Galician, and Spanish
Description:: Galician terminology databank, 6,000 terms
Rights:: Not specified

91. The Internet Language Reference Book

Creator:: Šmerk, Pavel, Pravdová, Markéta, Beneš, Martin, Černá, Anna, Hlaváčková, Dana, Chromý, Jan, Konečná, Hana, Kopecký, Jakub, Mžourková, Hana, Pala, Karel, Prokšová, Hana, Prošek, Martin, Smejkalová, Kamila, Svobodová, Ivana, and Uhlířová, Ludmila
Publisher:: Institute of Czech Language, Czech Academy of Sciences and Masaryk University, NLP Centre
Type:: toolService and service
Subject:: literature
Language:: Czech and English
Description:: The ILRB has been created by two cooperating teams - by the team of the Institute of Czech Language, Czech Academy of Sciences and the team of the NLP Centre at the Faculty of Informatics, Masaryk University (2004-2008). The tool consists of two sections: wordlist and reference (explanatory) one. Comments and remarks are welcome and should be send to the address poradna@ujc.cas.cz. 1. Wordlist section It contains more than 60 000 dictionary entries and is based on the glossary of the School Rules of Czech Orthography, the Dictionary of the Literary Czech and selected entries from the New Dictionary of Words of Foreign Origin and Dictionary of Neologisms. The entries typically include information that is asked about frequently by the users. Also inflectional forms of the particular words forms are offered in the form of tables thanks to the morphological analyzer ajka created at the Faculty of Informatics, MU. The dictionary part is linked to the explanatory one through the hypertext links. 2. Reference section It comprises the explanations about linguistic phenomena described in the Rules of Czech Orthography and contemporary Czech grammars, frequently and repeatedly asked by the users turning to the Linguistic Advisory Line in the Institute of Czech Language. In the offered explanations some typical spelling problems are dealt with including the appropriate recommendations. The ILRB is regularly updated and completed, new expressions are added and made more precise. and Academy of Sciences of the Czech Republic in project 1ET200610406 and Ministry of Education, Youth and Sports in projects LM2010013, LC536 and 2C06009.
Rights:: Not specified

92. The National Certificates corpus

Publisher:: Centre for Applied Language Studies, University of Jyväskylä
Type:: corpus
Language:: English, Finnish, French, German, Italian, Russian, Spanish, and Swedish
Description:: The NC test results, background information, speaking and writing performances in 9 foreign / second languages. A web-based data base (html files).
Rights:: Not specified

93. Tilde English-Latvian SMT system

Publisher:: Tilde
Type:: toolService
Language:: English
Description:: English-Latvian factored SMT system trained on different parallel texts
Rights:: Not specified

94. TreeTagger

Publisher:: University of Stuttgart
Type:: toolService
Subject:: POS tagger
Language:: Bulgarian, Dutch, English, French, German, Modern Greek (1453-), Italian, Portuguese, Russian, Spanish, and Swahili (macrolanguage)
Description:: A part-of-speech tagger and lemmatizer for several languages.
Rights:: Not specified

95. Treex::Web

Creator:: Sedlák, Michal
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: toolService and service
Subject:: Treex, Perl, REST, web service, and machine translation
Language:: English and Czech
Description:: Treex::Web is a web frontend for running Treex applications from your browser. Treex (formerly TectoMT) is a highly modular NLP framework implemented in Perl programming language. It is primarily aimed at Machine Translation, making use of the ideas and technology created during the Prague Dependency Treebank project.
Rights:: Not specified

96. Typological Database System

Publisher:: Max Planck Institute for Psycholinguistics, University of Utrecht/Netherlands Graduate School of Linguistics, Data Archiving and Networked Services, and Meertens Institute KNAW The Netherlands
Type:: toolService
Subject:: typological database
Language:: English
Description:: The Typological Database System (TDS) is a web-based service that provides integrated access to a collection of independently created typological databases. It was developed with support from NWO grant 380-30-004 / INV-03-12 and from participating universities, and provides continued availability and extended documentation for its component databases, through a uniform structure and search interface. Web technologies evolve rapidly, and the system had begun to show its age even before the end of the project in 2009, motivating migration of the data collection to an archival platform. Through its Project Call 1, CLARIN-NL granted funding for migrating the resource to a durable, archival environment and converting it to a true web service architecture.
Rights:: Not specified

97. Vocabulario multilingüe de economía

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: lexicalConceptualResource
Subject:: terminology database
Language:: Basque, Catalan, English, Galician, and Spanish
Description:: Multilingual terminological resource containing 20.941 terms from the Economics, Finance and Banking domains.
Rights:: Not specified

98. Wikicorpus

Publisher:: Centro de Tecnologías y Aplicaciones del Lenguaje y del Habla (TALP)
Type:: corpus
Subject:: trilingual corpus
Language:: Catalan, English, and Spanish
Description:: Trilingual corpus (Catalan, Spanish, English) that contains large portions of the Wikipedia (based on a 2006 dump) and has been automatically enriched with linguistic information. In its present version, it contains over 750 million words.
Rights:: Not specified

99. Wmatrix

Publisher:: Lancaster University
Type:: toolService
Language:: English
Description:: Wmatrix is a corpus comparison and annotation tool. It is web based and incorporates the CLAWS POS tagger and the USAS semantic tagger for English. It also generates frequency lists, concordances, key words and key semantic domains by comparative frequency profiling.
Rights:: Not specified

100. Wortschatz

Publisher:: University of Leipzig
Type:: corpus
Language:: Afrikaans, Albanian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Finnish, French, German, Hungarian, Icelandic, Indonesian, Italian, Japanese, Korean, Latin, Latvian, Lithuanian, Malay (macrolanguage), Norwegian, Occitan (post 1500), Romanian, Russian, Slovak, Slovenian, Spanish, Sundanese, Swedish, Tagalog, Turkish, Vietnamese, and Welsh
Description:: Collected from newspaper texts, webcrawling, etc.: words (+frequency), cooccurrences (+graph), left/right neighbours, example sentences
Rights:: Not specified

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Show values starting with

Creator

Show values starting with

Format

Language

Show values starting with

Publisher

Show values starting with

Rights

Subject

Show values starting with

Type

Date

Original context has metadata only

Harvested from