Show simple item record

 
dc.contributor.author Rüdiger, Jan Oliver
dc.date.accessioned 2020-01-21T08:05:49Z
dc.date.available 2020-01-21T08:05:49Z
dc.date.issued 2018-03-01
dc.identifier.uri http://hdl.handle.net/11372/LRT-2638
dc.description This corpus was originally created for performance testing (server infrastructure CorpusExplorer - see: diskurslinguistik.net / diskursmonitor.de). It includes the filtered database (German texts only) of CommonCrawl (as of March 2018). First, the URLs were filtered according to their top-level domain (de, at, ch). Then the texts were classified using NTextCat and only uniquely German texts were included in the corpus. The texts were then annotated using TreeTagger (token, lemma, part-of-speech). 2.58 million documents - 232.87 million sentences - 3.021 billion tokens. You can use CorpusExplorer (http://hdl.handle.net/11234/1-2634) to convert this data into various other corpus formats (XML, JSON, Weblicht, TXM and many more).
dc.language.iso deu
dc.publisher Rüdiger, Jan Oliver
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.source.uri https://notes.jan-oliver-ruediger.de/korpora/
dc.subject corpus
dc.subject German
dc.subject Germanistik
dc.subject Web corpus
dc.subject web corpora
dc.subject CorpusExplorer
dc.title CEHugeWebCorpus
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LRT + Open Submissions
contact.person Jan Oliver Rüdiger e-mail@jan-oliver-ruediger.de University of Siegen
size.info 3021000000 tokens
files.size 15840988419
files.count 1


 Files in this item

Icon
Name
CEHugeWebCorpus.zip
Size
14.75 GB
Format
application/zip
Description
Corpora (CEC6-Format)
MD5
5e60cd05aa408786372f03cc9733b4cf
 Download file

Show simple item record