Harvested from: LINDAT/CLARIAH-CZ repository - LINDAT/CLARIAH-CZ Catalog Search Results

Start Over Harvested from LINDAT/CLARIAH-CZ repository Date 2010 to 2011

1. A Gold Standard Word Alignment for English-Swedish

Creator:: Ahrenberg, Lars and Holmqvist, Maria
Publisher:: Linköping University
Type:: text, wordList, and lexicalConceptualResource
Subject:: word alignment
Language:: Swedish and English
Description:: A Gold Standard Word Alignment for English-Swedish (GES) is a resource containing 1164 manually word aligned sentences pairs from English and Swedish versions of Europarl v. 2. The data can be found here: https://www.ida.liu.se/labs/nlplab/ges/
Rights:: Not specified

2. Chared

Creator:: Pomikálek, Jan
Publisher:: Masaryk University, NLP Centre
Type:: toolService and tool
Subject:: character encoding, character encoding detection, charset, and unicode
Language:: English
Description:: Chared is a software tool which can detect character encoding of a text document provided the language of the document is known. The language of the text has to be specified as an input parameter so that the corresponding language model can be used. The package contains models for a wide range of languages (currently 57 --- covering all major languages). Furthermore, it provides a training script to learn models for additional languages using a set of user supplied sample html pages in the given language. The detection algorithm is based on determining similarity of byte trigrams vectors. In general, chared should be more accurate than other character encoding detection tools with no language constraints. This is an important advantage allowing precise character decoding needed for building large textual corpora. The tool has been used for building corpora in American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting of 70 billions tokens altogether. Chared is an open source software, licensed under New BSD License and available for download (including the source code) at http://code.google.com/p/chared/. The research leading to this piece of software was published in POMIKÁLEK, Jan a Vít SUCHOMEL. chared: Character Encoding Detection with a Known Language. In Aleš Horák, Pavel Rychlý. RASLAN 2011. 5. vyd. Brno, Czech Republic: Tribun EU, 2011. od s. 125-129, 5 s. ISBN 978-80-263-0077-9. and PRESEMT, Lexical Computing Ltd
Rights:: BSD 3-Clause "New" or "Revised" license, http://opensource.org/licenses/BSD-3-Clause, and PUB

3. Copenhagen Dependency Treebanks versions 1-3

Publisher:: Copenhagen Business School
Format:: application/octet-stream
Type:: corpus
Subject:: parallel treebank, POS annotation, discourse annotation, morphological annotation, syntactic annotation, and semantic annotation
Language:: Danish, English, German, Italian, and Spanish
Description:: Parallel treebanks with annotation of syntax, discourse, coreference, morphology, and semantics. Version 3 also includes the Danish Dependency Treebank (version 1) and the Danish-English Parallel Dependency Treebank (version 2).
Rights:: GNU General Public License

4. Corpus of contemporary blogs

Creator:: Grác, Marek
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: corpus, blogs, annotation, annotators, sentences, and machine learning
Language:: Czech
Description:: In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based system. In order to make enough training data for machine learning, annotators manually split the corpus of contemporary text CBB.blog (1 million tokens) into sentences. Each file contains one hundredth of the whole corpus and all data were processed in parallel by two annotators. The corpus was created from ten contemporary blogs: hintzu.otaku.cz modnipeklo.cz bloc.cz aleneprokopova.blogspot.com blog.aktualne.cz fuchsova.blog.onaidnes.cz havlik.blog.idnes.cz blog.aktualne.centrum.cz klusak.blogspot.cz myego.cz/welldone
Rights:: Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB

5. Croatian-English Parallel Corpus

Publisher:: University of Zagreb, Faculty of Humanities and Social Sciences
Type:: corpus
Language:: Croatian and English
Description:: written; domain-specific (newspaper); synchronic; bilingual; parallel; unidirectional; XML; S-alignment
Rights:: Not specified

6. Dictionaries of Luxembourgish

Publisher:: University of Luxembourg
Format:: application/octet-stream
Type:: corpus
Language:: Luxembourgish
Description:: Online database of three older dictionaries of Luxembourgish from 1849, 1905, and 1950
Rights:: Not specified

7. Dictionary of the standard Latvian language

Publisher:: Institute of Mathematics and Computer Science, University of Latvia
Format:: text/html
Type:: lexicalConceptualResource
Subject:: dictionary
Language:: Latvian
Description:: ~64 000 entries
Rights:: Not specified

8. EDBL: Lexical Data Base for Basque (Euskararen Datu-base Lexikala)

Publisher:: University of the Basque Country
Format:: application/xml
Type:: lexicalConceptualResource
Language:: Basque
Description:: EDBL (Lexical DataBase for Basque) is the lexical basis needed for the automatic treatment of Basque. It is made up of about 120.000 entries divided into dictionary entries (the same you can find in a conventional dictionay), verb forms and dependent morphemes, all of them with their respective morphological information.
Rights:: Only for research and demonstrative purposes

9. English-Urdu Religious Parallel Corpus

Creator:: Jawaid, Bushra and Zeman, Daniel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: parallel corpus, religious text, and machine translation
Language:: English and Urdu
Description:: English-Urdu parallel corpus is a collection of religious texts (Quran, Bible) in English and Urdu language with sentence alignments. The corpus can be used for experiments with statistical machine translation. Our modifications of crawled data include but are not limited to the following: 1- Manually corrected sentence alignment of the corpora. 2- Our data split (training-development-test) so that our published experiments can be reproduced. 3- Tokenization (optional, but needed to reproduce our experiments). 4- Normalization (optional) of e.g. European vs. Urdu numerals, European vs. Urdu punctuation, removal of Urdu diacritics.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

10. Italian Content Words

Creator:: Grella, Matteo
Publisher:: Matteo Grella
Type:: text, machineReadableDictionary, and lexicalConceptualResource
Subject:: morphological dictionary
Language:: Italian
Description:: This resource is an Italian morphological dictionary for content words, encoded in a JSON Lines format text file. It contains correspondences between surface form and lexical forms of words followed by grammatical features. The surface word forms have been generated algorithmically by using stable phonological and morphological rules of the Italian language. Particular attention has been given to the generation of verbs for which rules have been extracted from the famous A.L e G. Lepschy, La lingua italiana. The dictionary with its remarkable coverage is particularly useful used together with the Italian Function Words (http://hdl.handle.net/11372/LRT-2288) for tasks such as POS-Tagging or Syntactic Parsing.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

1. A Gold Standard Word Alignment for English-Swedish

2. Chared

3. Copenhagen Dependency Treebanks versions 1-3

4. Corpus of contemporary blogs

5. Croatian-English Parallel Corpus

6. Dictionaries of Luxembourgish

7. Dictionary of the standard Latvian language

8. EDBL: Lexical Data Base for Basque (Euskararen Datu-base Lexikala)

9. English-Urdu Religious Parallel Corpus

10. Italian Content Words

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Creator

Show values starting with

Format

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Date

Original context has metadata only

Harvested from