Czech Web Corpus 2017 (csTenTen17)

Name: Czech Web Corpus 2017 (csTenTen17)
License: https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC
Keywords: Web corpus

Suchomel, Vít

Zobrazit minimální záznam

dc.contributor.author	Suchomel, Vít
dc.date.accessioned	2022-09-15T14:18:30Z
dc.date.available	2022-09-15T14:18:30Z
dc.date.issued	2018-12-07
dc.identifier.citation	Suchomel, Vít. csTenTen17, a Recent Czech Web Corpus. In Recent Advances in Slavonic Natural Language Processing, pp. 111–123. 2018.
dc.identifier.uri	http://hdl.handle.net/11234/1-4835
dc.description	The Czech Web Corpus 2017 (csTenTen17) is a Czech corpus made up of texts collected from the Internet, mostly from the Czech national top level domain ".cz". The data was crawled by web crawler SpiderLing (https://corpus.tools/wiki/SpiderLing). The data was cleaned by removing boilerplate (using https://corpus.tools/wiki/Justext), removing near-duplicate paragraphs (by https://corpus.tools/wiki/Onion) and discarding paragraphs not in the target language. The corpus was POS annotated by morphological analyser Majka using this POS tagset: https://www.sketchengine.eu/tagset-reference-for-czech/. Text sources: General web, Wikipedia. Time span of crawling: May, October and November 2017, October and November 2016, October and November 2015. The Czech Wikipedia part was downloaded in November 2017. Data format: Plain text, vertical (one token per line), gzip compressed. There are the following structures in the vertical: Documents (<doc/>, usually corresponding to web pages), paragraphs (<p/>), sentences (<s/>) and word join markers (<g/>, a "glue" tag indicating that there was no space between the surrounding tokens in the original text). Document metadata: src (the source of the data), title (the title of the web page), url (the URL of the document), crawl_date (the date of downloading the document). Paragraph metadata: heading ("1" if the paragraph is a heading, usually <h1> to <h6> elements in the original HTML data). Block elements in the case of an HTML source or double blank lines in the case of other source formats were used as paragraph separators. An internal heuristic tool was used to mark sentence breaks. The tab-separated positional attributes are: word form, morphological annotation, lem-POS (the base form of the word, i.e. the lemma, with a part of speech suffix) and gender respecting lemma (nouns and adjectives only). Please cite the following paper when using the corpus for your research: Suchomel, Vít. csTenTen17, a Recent Czech Web Corpus. In Recent Advances in Slavonic Natural Language Processing, pp. 111–123. 2018. (https://nlp.fi.muni.cz/raslan/raslan18.pdf#page=119)
dc.language.iso	ces
dc.publisher	Masaryk University, NLP Centre
dc.publisher	Lexical Computing CZ s.r.o.
dc.relation.isreferencedby	https://nlp.fi.muni.cz/raslan/raslan18.pdf#page=119
dc.rights	NLP Centre Web Corpus License
dc.rights.uri	https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC
dc.source.uri	https://nlp.fi.muni.cz/projects/cstenten/
dc.subject	Web corpus
dc.title	Czech Web Corpus 2017 (csTenTen17)
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
dc.rights.label	ACA
has.files	yes
branding	LINDAT / CLARIAH-CZ
contact.person	Vít Suchomel xsuchom2@fi.muni.cz Natural Language Processing Centre, Masaryk University
contact.person	Pavel Rychlý pary@fi.muni.cz Natural Language Processing Centre, Masaryk University
contact.person	Miloš Jakubíček milos.jakubicek@sketchengine.eu Lexical Computing CZ s.r.o.
sponsor	Ministerstvo školství, mládeže a tělovýchovy České republiky LM2018101 LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy nationalFunds
size.info	12500000000 tokens
files.size	91838674037
files.count	1

Soubory tohoto záznamu

Licenční kategorie:

Academic Use

Licence: NLP Centre Web Corpus License

Název: csTenTen17.vert.gz
Velikost: 85.53 GB
Formát: application/x-gzip
Popis: data
MD5: 9f56f9de95892a1da24576ad513fe9ab

Stáhnout soubor

Zobrazit minimální záznam