Publisher: Masaryk University, NLP Centre / Rights: PUB / Type: corpus

1. Amharic WIC Corpus

Creator:: Rychlý, Pavel
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: text corpora, Ethiopian languages, web corpora, under-resourced languages, and Amharic
Language:: Amharic
Description:: Substantially cleaned version of existing morphologically annotated WIC Corpus.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

2. BushBank

Creator:: Grác, Marek
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: interannotator agreement, corpus, chunks, phrases, and clauses
Language:: Czech
Description:: Czech corpus annotated for NP and clause chunks by 3-11 annotators (with average inter-annotator agreement at 88%). It consists of 10,000 sentences.
Rights:: Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB

3. Corpus of contemporary blogs

Creator:: Grác, Marek
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: corpus, blogs, annotation, annotators, sentences, and machine learning
Language:: Czech
Description:: In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based system. In order to make enough training data for machine learning, annotators manually split the corpus of contemporary text CBB.blog (1 million tokens) into sentences. Each file contains one hundredth of the whole corpus and all data were processed in parallel by two annotators. The corpus was created from ten contemporary blogs: hintzu.otaku.cz modnipeklo.cz bloc.cz aleneprokopova.blogspot.com blog.aktualne.cz fuchsova.blog.onaidnes.cz havlik.blog.idnes.cz blog.aktualne.centrum.cz klusak.blogspot.cz myego.cz/welldone
Rights:: Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB

4. Czech Grammar Agreement Dataset for Evaluation of Language Models

Creator:: Baisa, Vít
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: agreement, past tense verb suffix, language model, and training data
Language:: Czech
Description:: AGREE is a dataset and task for evaluation of language models based on grammar agreement in Czech. The dataset consists of sentences with marked suffixes of past tense verbs. The task is to choose the right verb suffix which depends on gender, number and animacy of subject. It is challenging for language models because 1) Czech is morphologically rich, 2) it has relatively free word order, 3) high out-of-vocabulary (OOV) ratio, 4) predicate and subject can be far from each other, 5) subjects can be unexpressed and 6) various semantic rules may apply. The task provides a straightforward and easily reproducible way of evaluating language models on a morphologically rich language.
Rights:: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB

5. czes

Creator:: (:unav) Unknown author
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: Czech corpus large
Language:: Czech
Description:: First version of the very large Czech corpus Czes created with a new set of tools. It comprises 465,102,710 tokens. and Lexical Computing Ltd.
Rights:: Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB

6. czTenTen12 v9 subcorpus of problematic phenomena

Creator:: Pelikánová, Zuzana and Nevěřilová, Zuzana
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: Non-standard language, Interlingual homographs, and Text corpus
Language:: Czech
Description:: czTenTen12 v9 subcorpus containing problematic features (interlingual homographs, foreign proper names, named entities)
Rights:: Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB

7. English-Czech Corpus from Wikipedia

Creator:: Štromajerová, Adéla, Baisa, Vít, and Blahuš, Marek
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: Wikipedia
Language:: English and Czech
Description:: Sentence-parallel corpus made from English and Czech Wikipedias based on translated articles from English into Czech. The work done is described in the paper: ŠTROMAJEROVÁ, Adéla, Vít BAISA a Marek BLAHUŠ. Between Comparable and Parallel: English-Czech Corpus from Wikipedia. In RASLAN 2016 Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016. s. 3-8, 6 s. ISBN 978-80-263-1095-2.
Rights:: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB

8. skTenTen

Creator:: (:unav) Unknown author
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: Slovak large corpus
Language:: Slovak
Description:: Slovak large web corpus skTenTen, comprising 876,003,720 tokens. and Lexical Computing Ltd.
Rights:: Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB

9. SQAD

Creator:: Medveď, Marek and Horák, Aleš
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: question answering, Simple Question Answering Database, and SQAD
Language:: Czech
Description:: The SQAD database consists of 3301 records obtained from Czech Wikipedia articles. The record structure is following: - the original sentence(s) from Wikipedia - a question that is directly answered in the text - the expected answer to the question as it appears in the original text - the URL of the Wikipedia web page from which the original text was extracted - name of the author of this SQAD record
Rights:: GNU General Public Licence, version 3, http://opensource.org/licenses/GPL-3.0, and PUB

10. sqad 3.0

Creator:: Medveď, Marek and Horák, Aleš
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: Simple Question Answering Database, Czech, and question answering
Language:: Czech
Description:: Simple question answering database version 3 (SQAD v3) created from Czech Wikipedia. New version consits of 13477 records. Each record of SQAD consist of multiple files - question, answer extraction, answer selection, ulr, question metadata and in some cases answer context.
Rights:: GNU Library or "Lesser" General Public License 3.0 (LGPL-3.0), http://opensource.org/licenses/LGPL-3.0, and PUB

1. Amharic WIC Corpus

2. BushBank

3. Corpus of contemporary blogs

4. Czech Grammar Agreement Dataset for Evaluation of Language Models

5. czes

6. czTenTen12 v9 subcorpus of problematic phenomena

7. English-Czech Corpus from Wikipedia

8. skTenTen

9. SQAD

10. sqad 3.0

Limit your search

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Creator

Language

Publisher

Rights

Show values starting with

Subject

Show values starting with

Type

Date

Original context has metadata only

Harvested from