A Gold Standard Word Alignment for English-Swedish (GES) is a resource containing 1164 manually word aligned sentences pairs from English and Swedish versions of Europarl v. 2.
The data can be found here: https://www.ida.liu.se/labs/nlplab/ges/
A vocabulary resulting from the cooperation of the groups of REALITER network that collects the basic terminology mostly used in texts about Genomics. It contains equivalents in English, Peninsular and Latinamerican Spanish, French, Italian, Galician, Portuguese and Catalan.
A collection of parallel corpora: English-Lithuanian (2m words), Lithuanian-English (0,06m words), Czech-Lithuanian (0,8m words), Lithuanian-Czech (0,02m words). All the corpora are online-searcheable via one interface at http://donelaitis.vdu.lt/main_en.php?id=4&nr=1_2. The corpus is still being updated with new texts.
Digital copies of historical botanic papers from the Missouri Botanical Garden Library; Bilddigitalisate von historischen botanischen Schriften; deutschsprachige Texte stellen nur einen Teilbereich dar
bi-directional parallel corpus based on an open-ended collection of Portuguese-English and English-Portuguese source-texts and translations. Searchable via the IMS Corpus Query Processor and the DISPARA interface
Oral corpus containing 166 narratives in English elicited by means of Labovian techniques. Participants from the UK (England, Wales, Scotland), Ireland, USA, Australia and South Africa.
domain specific corpus (Law, Economy, Computing, Medicine and Environment as well as a contrastive corpus from the press); EN 3.3 M tokens, SP 33 M tokens, CAT 19 M tokens; EAGLEs pos tagset
Software for corpus linguists and text/data mining enthusiasts. The CorpusExplorer combines over 45 interactive visualizations under a user-friendly interface. Routine tasks such as text acquisition, cleaning or tagging are completely automated. The simple interface supports the use in university teaching and leads users/students to fast and substantial results. The CorpusExplorer is open for many standards (XML, CSV, JSON, R, etc.) and also offers its own software development kit (SDK).
Source code available at https://github.com/notesjor/corpusexplorer2.0
A parsed corpus of spoken English. Ca 400,000 words from ICE-GB (early 1990s) and 400,000 words from the London-Lund Corpus (late 1960s-early 1980s). The orthographic transcriptions have been normalised and annotated.
Historical dictionary of the Scottish language as written and spoken by lowland Scots in Scotland and Ulster from the 12th century onward. Over eighty thousand full-word entries.
Speech corpus comprising 4608 spoken sentences recorded for speech timing research. The complete archive, available for downloading, includes a structured list of the sentences, the speech recordings and the label files, plus full documentation.
Open source language analysis tool suite: tokenizer, stemmer/lemmatizer, named entity recognizer, chunker/segmenter, morphosyntactic tagger, syntactic tagger, corpus processer, morphological tagger, semantic tagger, analyzer, Word Sense Disambiguator.
Collection of orthographically transcribed audio recorded speech, mainly from East Anglia and the South-West, with a minor collection from Lancashire. The recordings were made in the 1970s and the 1980s by Finnish postgraduates.
One million words of written and spoken English from Great Britain. Transcriptions aligned with digitised speech recordings. POS-tagged and parsed. Part of the International Corpus of English project. Custom-made search software: ICE-CUP
1 million words spoken and written English from UK. POS-tagged and parsed. Digitised speech recordings aligned w text. Part of the International Corpus of English (ICE).
36 hours of speech recordings of nine urban varieties of UK English, collected among 16-year-olds in secondary schools. Part of the corpus has been prosodically transcribed.
JIRS is a Passage Retrieval system specially suited for Question Answering. It could be adapted to others languages very easily. ask (Written Language): Information Retrieval Applications Question/Answering Environment: OS-independent Access: GPLv3
Parallel corpus, 3,297,283 words.
The idea was to create a small parallel corpus which would enable to work with entire texts in translation analysis rather then short extracts. At the same time it aimed at acquiring experience that could be used in creating a larger parallel corpus of English and Czech in the future.
Although the main part of work has been completed -- and the aims of the KACENKA grant met -- we keep improving and enlarging KACENKA gradually. Currently, it has the size of 3,297,283 words (out of which, 1,689,513 have been acquired by means of scanning).
Most of the English texts for KACENKA have been retrieved from the Internet resources. The rest -- and nearly all the Czech texts -- had to be scanned with the use of an OCR programme.
KACENKA is stored on a single CD-ROM; its use is limited by copyright restrictions.
MEBA is a lexical aligner, implemented in C#, based on an iterative algorithm that uses pre-processing steps: sentence alignment ([[http://www.clarin.eu/tools/sal-sentence-aligner|SAL]]), tokenization, POS-tagging and lemmatization (through [[http://www.clarin.eu/tools/ttl-tokenizing-tagging-and-lemmatizing-free-running-texts|TTL]], sentence chunking. Similar to YAWA aligner, MEBA generates the links step by step, beginning with the most probable (anchor links). The links to be
added at any later step are supported or restricted by the links created in the previous iterations. The aligner has different weights and different significance thresholds on each feature and iteration. Each of the iterations can be configured to align different categories of tokens (named entities, dates and numbers, content words, functional words, punctuation) in decreasing order of statistical evidence.
MEBA has an individual F-measure of 81.71% and it is currently integrated in the platform [[http://www.clarin.eu/tools/cowal-combined-word-aligner|COWAL]].
More detailed descriptions are available in [[http://www.racai.ro/~tufis/papers|the following papers]]:
-- Dan Tufiş (2007). Exploiting Aligned Parallel Corpora in Multilingual Studies and Applications. In Toru Ishida, Susan R. Fussell, and Piek T.J.M. Vossen (eds.), Intercultural Collaboration. First International Workshop (IWIC 2007), volume 4568 of Lecture Notes in Computer Science, pp. 103-117. Springer-Verlag, August 2007. ISBN 978-3-540-73999-9.
-- -- Dan Tufiş, Radu Ion, Alexandru Ceauşu, and Dan Ştefănescu (2006). Improved Lexical Alignment by Combining Multiple Reified Alignments. In Toru Ishida, Susan R. Fussell, and Piek T.J.M. Vossen (eds.), Proceedings of the 11th Conference EACL2006, pp. 153-160, Trento, Italy, April 2006. Association for Computational Linguistics. ISBN 1-9324-32-61-2.
-- Dan Tufiş, Radu Ion, Alexandru Ceauşu, and Dan Ştefănescu (2005). Combined Aligners. In Proceedings of the ACL Workshop on Building and Using Parallel Texts: Data-Driven Machine Translation and Beyond, pp. 107-110, Ann Arbor, USA, June 2005. Association for Computational Linguistics. ISBN 978-973-703-208-9.
MBSP is a set of linguistic tools based on the TiMBL and MBT memory based learning applications developed at CNTS and ILK. It provides tools for Part of Speech tagging, Chunking, Lemmatizing, Relation Finding, Named Entity Recognition, and (for medical language) Semantic tagging.
MorphoDiTa: Morphological Dictionary and Tagger is an open-source tool for morphological analysis of natural language texts. It performs morphological analysis, morphological generation, tagging and tokenization and is distributed as a standalone tool or a library, along with trained linguistic models. In the Czech language, MorphoDiTa achieves state-of-the-art results with a throughput around 10-200K words per second. MorphoDiTa is a free software under LGPL license and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions.
Corpus of the ESF Foreign Language Speakers project; almost perfect structurefor IEI; completely metadata described; lots of annotated audio recordings containing multimodal interaction;
Multilingual lexical database that follows the model proposed by the EuroWordNet project. The MCR integrates into the same EuroWordNet framework wordnets from five different languages (together with four English WordNet versions). It also integrates WordNet Domains and new versions of the Base Concepts and Top Concept Ontology. Overall, it contains 1,642,389 semantic relations between synsets, most of them acquired by automatic means. Information contained: semantics, synonyms, antonyms, definition, equivalents, example of use, morphology.
NameTag is an open-source tool for named entity recognition (NER). NameTag identifies proper names in text and classifies them into predefined categories, such as names of persons, locations, organizations, etc. NameTag is distributed as a standalone tool or a library, along with trained linguistic models. In the Czech language, NameTag achieves state-of-the-art performance (Straková et al. 2013). NameTag is a free software under LGPL license and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions.
A corpus of dialect speech from Tyneside in North-East England. digitized audio, standard orthographic transcription, phonetic transcription, and part-of-speech tagged
Possibility to download or to browse free electronic books; Angebot: Download von und Online-Zugang zu frei verfügbaren E-Books; deutschsprachige Literatur stellt nur einen Teilbereich der verfügbaren E-Books dar
The RuN corpus is a parallel corpus consisting of Norwegian, Russian and English texts. The texts are aligned at the sentence level and have been tagged for grammatical information at the word level.
A corpus of approximately 260,000 words of modern British narrative texts representing three text types (fiction, newpapers, biography) with detailed annotation for all forms of speech, thought and writing presentation which occur in the corpus. Available via OTA.
9 speech databases for training and testing multilingual speech recognition applications in the car environment. Contains parallel 4 channel in-car recordings and a GSM channel. Contains interesting phonetically rich material. All orthographically transcribed. Speaker information included for gender, age, accent. Including pronunciation lexicon.
28 speech databases containing broadband recordings from 550 adults and 50 children per language. Contains interesting phonetically rich material. All orthographically transcribed. Speaker information included for gender, age, accent. Including pronunciation lexicon.
The SynSemClass Search Tool provides a web search tool for the SynSemClass 5.0 ontology. It includes several search options and criteria for building complex queries. The search results are rendered in a clear and user-friendly interactive representation.
A collection of pointers to teaching and learning materials on linguistics and linguistic tools, including quick starts, how-tos, technical documentation, short teaching modules (2h), and full courses. This resource is collaboratively built by its users.
The ILRB has been created by two cooperating teams - by the team of the Institute of Czech Language, Czech Academy of Sciences and the team of the NLP Centre at the Faculty of Informatics, Masaryk University (2004-2008).
The tool consists of two sections: wordlist and reference (explanatory) one. Comments and remarks are welcome and should be send to the address poradna@ujc.cas.cz.
1. Wordlist section
It contains more than 60 000 dictionary entries and is based on the glossary of the School Rules of Czech Orthography, the Dictionary of the Literary Czech and selected entries from the New Dictionary of Words of Foreign Origin and Dictionary of Neologisms. The entries typically include information that is asked about frequently by the users. Also inflectional forms of the particular words forms are offered in the form of tables thanks to the morphological analyzer ajka created at the Faculty of Informatics, MU. The dictionary part is linked to the explanatory one through the hypertext links.
2. Reference section
It comprises the explanations about linguistic phenomena described in the Rules of Czech Orthography and contemporary Czech grammars, frequently and repeatedly asked by the users turning to the Linguistic Advisory Line in the Institute of Czech Language. In the offered explanations some typical spelling problems are dealt with including the appropriate recommendations. The ILRB is regularly updated and completed, new expressions are added and made more precise. and Academy of Sciences of the Czech Republic in project 1ET200610406 and Ministry of Education, Youth and Sports in projects LM2010013, LC536 and 2C06009.
Treex::Web is a web frontend for running Treex applications from your browser.
Treex (formerly TectoMT) is a highly modular NLP framework implemented in Perl programming language. It is primarily aimed at Machine Translation, making use of the ideas and technology created during the Prague Dependency Treebank project.
The Typological Database System (TDS) is a web-based service that provides integrated access to a collection of independently created typological databases. It was developed with support from NWO grant 380-30-004 / INV-03-12 and from participating universities, and provides continued availability and extended documentation for its component databases, through a uniform structure and search interface. Web technologies evolve rapidly, and the system had begun to show its age even before the end of the project in 2009, motivating migration of the data collection to an archival platform. Through its Project Call 1, CLARIN-NL granted funding for migrating the resource to a durable, archival environment and converting it to a true web service architecture.
Trilingual corpus (Catalan, Spanish, English) that contains large portions of the Wikipedia (based on a 2006 dump) and has been automatically enriched with linguistic information. In its present version, it contains over 750 million words.
Wmatrix is a corpus comparison and annotation tool. It is web based and incorporates the CLAWS POS tagger and the USAS semantic tagger for English. It also generates frequency lists, concordances, key words and key semantic domains by comparative frequency profiling.