CEHugeWebCorpus

Name: CEHugeWebCorpus
License: http://creativecommons.org/licenses/by-nc-sa/4.0/

Rüdiger, Jan Oliver

dc.contributor.author	Rüdiger, Jan Oliver
dc.date.accessioned	2020-01-21T08:05:49Z
dc.date.available	2020-01-21T08:05:49Z
dc.date.issued	2018-03-01
dc.identifier.uri	http://hdl.handle.net/11372/LRT-2638
dc.description	This corpus was originally created for performance testing (server infrastructure CorpusExplorer - see: diskurslinguistik.net / diskursmonitor.de). It includes the filtered database (German texts only) of CommonCrawl (as of March 2018). First, the URLs were filtered according to their top-level domain (de, at, ch). Then the texts were classified using NTextCat and only uniquely German texts were included in the corpus. The texts were then annotated using TreeTagger (token, lemma, part-of-speech). 2.58 million documents - 232.87 million sentences - 3.021 billion tokens. You can use CorpusExplorer (http://hdl.handle.net/11234/1-2634) to convert this data into various other corpus formats (XML, JSON, Weblicht, TXM and many more).
dc.language.iso	deu
dc.publisher	Rüdiger, Jan Oliver
dc.rights	Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.source.uri	https://notes.jan-oliver-ruediger.de/korpora/
dc.subject	corpus
dc.subject	German
dc.subject	Germanistik
dc.subject	Web corpus
dc.subject	web corpora
dc.subject	CorpusExplorer
dc.title	CEHugeWebCorpus
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
dc.rights.label	PUB
has.files	yes
branding	LRT + Open Submissions
contact.person	Jan Oliver Rüdiger e-mail@jan-oliver-ruediger.de University of Siegen
size.info	3021000000 tokens
files.size	15840988419
files.count	1

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Name: CEHugeWebCorpus.zip
Size: 14.75 GB
Format: application/zip
Description: Corpora (CEC6-Format)
MD5: 5e60cd05aa408786372f03cc9733b4cf

Download file

Show simple item record