Harvested from: LINDAT/CLARIAH-CZ repository - LINDAT/CLARIAH-CZ Catalog Search Results

Start Over Harvested from LINDAT/CLARIAH-CZ repository Date 2010

1. Croatian-English Parallel Corpus

Publisher:: University of Zagreb, Faculty of Humanities and Social Sciences
Type:: corpus
Language:: Croatian and English
Description:: written; domain-specific (newspaper); synchronic; bilingual; parallel; unidirectional; XML; S-alignment
Rights:: Not specified

2. Dictionaries of Luxembourgish

Publisher:: University of Luxembourg
Format:: application/octet-stream
Type:: corpus
Language:: Luxembourgish
Description:: Online database of three older dictionaries of Luxembourgish from 1849, 1905, and 1950
Rights:: Not specified

3. Dictionary of the standard Latvian language

Publisher:: Institute of Mathematics and Computer Science, University of Latvia
Format:: text/html
Type:: lexicalConceptualResource
Subject:: dictionary
Language:: Latvian
Description:: ~64 000 entries
Rights:: Not specified

4. EDBL: Lexical Data Base for Basque (Euskararen Datu-base Lexikala)

Publisher:: University of the Basque Country
Format:: application/xml
Type:: lexicalConceptualResource
Language:: Basque
Description:: EDBL (Lexical DataBase for Basque) is the lexical basis needed for the automatic treatment of Basque. It is made up of about 120.000 entries divided into dictionary entries (the same you can find in a conventional dictionay), verb forms and dependent morphemes, all of them with their respective morphological information.
Rights:: Only for research and demonstrative purposes

5. English-Urdu Religious Parallel Corpus

Creator:: Jawaid, Bushra and Zeman, Daniel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: parallel corpus, religious text, and machine translation
Language:: English and Urdu
Description:: English-Urdu parallel corpus is a collection of religious texts (Quran, Bible) in English and Urdu language with sentence alignments. The corpus can be used for experiments with statistical machine translation. Our modifications of crawled data include but are not limited to the following: 1- Manually corrected sentence alignment of the corpora. 2- Our data split (training-development-test) so that our published experiments can be reproduced. 3- Tokenization (optional, but needed to reproduce our experiments). 4- Normalization (optional) of e.g. European vs. Urdu numerals, European vs. Urdu punctuation, removal of Urdu diacritics.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

6. Living Oral History Workbench: Interviewproject Nederlandse Veteranen (IPNV)

Publisher:: The Netherlands Veteran Institute, Centre for Language and Speech Technology, Radboud University, and Data Archiving and Networked Services
Format:: text/plain
Type:: corpus
Language:: Dutch
Description:: The Netherlands Veterans Institute (VI) hosts about 250 interviews (audio) in which Dutch former military personel speak about their experiences during World War II (interviews about the years 1935-1945) and decolonisation in the Dutch East Indies (1945-1950) and Dutch New Guinea (1960-1962). In the project Living Oral History Workbench these interviews have been indexed by automatic speech recognition techniques. The list of interviews and their metadata are available at the CLARIN Center; researchers may apply to VI for access to the data.
Rights:: Not specified

7. Luxogramm - Grammatisches Informationssystem zum Luxemburgischen

Publisher:: University of Luxembourg
Format:: application/octet-stream
Type:: languageDescription
Language:: Luxembourgish
Description:: Luxogramm provides grammatical information (paradigms, rules, categories) for all Luxembourgish verbs
Rights:: Not specified

8. Multiword expressions in the Prague Dependency Treebank 2.0

Creator:: Bejček, Eduard, Klyueva, Natalia, Straňák, Pavel, Šidák, Pavel, Šťastná, Eva, Vimmrová, Pavlína, and Hajič, Jan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: MWE, multiword expressions, idiom, phraseme, and named entity
Language:: Czech
Description:: This dataset adds annotation of multiword expressions and multiword named entities to the original PDT 2.0 data. The annotation is stand-off, stored in the same PML format as the original PDT 2.0 data. It is to be used together with the PDT 2.0. and grant 1ET201120505 of the Academy of Sciences of the Czech Republic and grant MSM0021620838 of the Ministry of Youth, Education and Sport of The Czech Republic
Rights:: Creative Commons - Attribution 3.0 Unported (CC BY 3.0), http://creativecommons.org/licenses/by/3.0/, and PUB

9. Nederlandse Familienamen Databank (Dutch Database of Family Names)

Publisher:: Meertens Institute KNAW The Netherlands
Format:: application/octet-stream
Type:: toolService
Language:: Dutch
Description:: Enriched database of (mainly) Dutch family names, based on 1947 census (in progress; currently 90.000 entries from 140.000 max)
Rights:: Meertens Institute KNAW The Netherlands

10. SYN2009PUB: corpus of Czech newspapers

Creator:: Křen, Michal, Bartoň, Tomáš, Hnátková, Milena, Jelínek, Tomáš, Petkevič, Vladimír, Procházka, Pavel, and Skoumalová, Hana
Publisher:: Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
Type:: text and corpus
Subject:: corpus and written language
Language:: Czech
Description:: Corpus of contemporary Czech newspapers and magazines sized 700 MW. It contains various titles published between 1995–2007. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods. The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document. and MSM0021620823 – Český národní korpus a korpusy dalších jazyků
Rights:: Czech National Corpus (Shuffled Corpus Data), https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc, and ACA

1. Croatian-English Parallel Corpus

2. Dictionaries of Luxembourgish

3. Dictionary of the standard Latvian language

4. EDBL: Lexical Data Base for Basque (Euskararen Datu-base Lexikala)

5. English-Urdu Religious Parallel Corpus

6. Living Oral History Workbench: Interviewproject Nederlandse Veteranen (IPNV)

7. Luxogramm - Grammatisches Informationssystem zum Luxemburgischen

8. Multiword expressions in the Prague Dependency Treebank 2.0

9. Nederlandse Familienamen Databank (Dutch Database of Family Names)

10. SYN2009PUB: corpus of Czech newspapers

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Creator

Show values starting with

Format

Language

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Date

Original context has metadata only

Harvested from