Harvested from: LINDAT/CLARIAH-CZ repository / Rights: https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC

1. Amharic Web Corpus

Creator:: Suchomel, Vít and Rychlý, Pavel
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: Amharic, text corpus, Web corpus, under-resourced language, corpus annotation, and morphological tagger
Language:: Amharic
Description:: Amharic web corpus. Crawled by SpiderLing in August 2013 and October 2015 and January 2016. Encoded in UTF-8, cleaned, deduplicated. Tagged by TreeTagger trained on Amharic WIC corpus.
Rights:: NLP Centre Web Corpus License, https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC, and ACA

2. Czech Web Corpus 2017 (csTenTen17)

Creator:: Suchomel, Vít
Publisher:: Masaryk University, NLP Centre and Lexical Computing CZ s.r.o.
Type:: text and corpus
Subject:: Web corpus
Language:: Czech
Description:: The Czech Web Corpus 2017 (csTenTen17) is a Czech corpus made up of texts collected from the Internet, mostly from the Czech national top level domain ".cz". The data was crawled by web crawler SpiderLing (https://corpus.tools/wiki/SpiderLing). The data was cleaned by removing boilerplate (using https://corpus.tools/wiki/Justext), removing near-duplicate paragraphs (by https://corpus.tools/wiki/Onion) and discarding paragraphs not in the target language. The corpus was POS annotated by morphological analyser Majka using this POS tagset: https://www.sketchengine.eu/tagset-reference-for-czech/. Text sources: General web, Wikipedia. Time span of crawling: May, October and November 2017, October and November 2016, October and November 2015. The Czech Wikipedia part was downloaded in November 2017. Data format: Plain text, vertical (one token per line), gzip compressed. There are the following structures in the vertical: Documents (<doc/>, usually corresponding to web pages), paragraphs (<p/>), sentences (<s/>) and word join markers (<g/>, a "glue" tag indicating that there was no space between the surrounding tokens in the original text). Document metadata: src (the source of the data), title (the title of the web page), url (the URL of the document), crawl_date (the date of downloading the document). Paragraph metadata: heading ("1" if the paragraph is a heading, usually <h1> to <h6> elements in the original HTML data). Block elements in the case of an HTML source or double blank lines in the case of other source formats were used as paragraph separators. An internal heuristic tool was used to mark sentence breaks. The tab-separated positional attributes are: word form, morphological annotation, lem-POS (the base form of the word, i.e. the lemma, with a part of speech suffix) and gender respecting lemma (nouns and adjectives only). Please cite the following paper when using the corpus for your research: Suchomel, Vít. csTenTen17, a Recent Czech Web Corpus. In Recent Advances in Slavonic Natural Language Processing, pp. 111–123. 2018. (https://nlp.fi.muni.cz/raslan/raslan18.pdf#page=119)
Rights:: NLP Centre Web Corpus License, https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC, and ACA

3. enTenTen

Creator:: (:unav) Unknown author
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: English large corpus
Language:: English
Description:: Very large English web corpus enTenTEn, comprising 3,268,798,627 tokens. and Lexical Computing Ltd.
Rights:: NLP Centre Web Corpus License, https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC, and ACA

4. Indonesian web corpus

Creator:: MEDVEĎ, MAREK and Suchomel, Vít
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: Web corpus
Language:: Indonesian
Description:: Indonesian web corpus crawled in 2010. Encoded in UTF-8, cleaned, deduplicated, tagged by Morphind.
Rights:: NLP Centre Web Corpus License, https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC, and ACA

5. Indonesian web corpus (idWac)

Creator:: Medveď, Marek and Suchomel, Vít
Publisher:: Natural Language Processing Centre, Faculty of Informatics, Masaryk University
Type:: text and corpus
Subject:: corpus, lemmatization, and PoS tagging
Language:: Indonesian
Description:: Indonesian text corpus from web. Crawling done by SpiderLing in 2017. Filtering by JusText and Onion (see http://corpus.tools/ for details). Tagged and lemmatized by MorphInd (http://septinalarasati.com/morphind/).
Rights:: NLP Centre Web Corpus License, https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC, and ACA

1. Amharic Web Corpus

2. Czech Web Corpus 2017 (csTenTen17)

3. enTenTen

4. Indonesian web corpus

5. Indonesian web corpus (idWac)

6. Oromo web corpus

7. Somali Web Corpus

8. Tigrinya Web Corpus

Limit your search

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Creator

Language

Publisher

Rights

Subject

Show values starting with

Type

Date

Original context has metadata only

Harvested from