dc.contributor.author | Rüdiger, Jan Oliver |
dc.date.accessioned | 2020-01-21T08:05:49Z |
dc.date.available | 2020-01-21T08:05:49Z |
dc.date.issued | 2018-03-01 |
dc.identifier.uri | http://hdl.handle.net/11372/LRT-2638 |
dc.description | This corpus was originally created for performance testing (server infrastructure CorpusExplorer - see: diskurslinguistik.net / diskursmonitor.de). It includes the filtered database (German texts only) of CommonCrawl (as of March 2018). First, the URLs were filtered according to their top-level domain (de, at, ch). Then the texts were classified using NTextCat and only uniquely German texts were included in the corpus. The texts were then annotated using TreeTagger (token, lemma, part-of-speech). 2.58 million documents - 232.87 million sentences - 3.021 billion tokens. You can use CorpusExplorer (http://hdl.handle.net/11234/1-2634) to convert this data into various other corpus formats (XML, JSON, Weblicht, TXM and many more). |
dc.language.iso | deu |
dc.publisher | Rüdiger, Jan Oliver |
dc.rights | Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.source.uri | https://notes.jan-oliver-ruediger.de/korpora/ |
dc.subject | corpus |
dc.subject | German |
dc.subject | Germanistik |
dc.subject | Web corpus |
dc.subject | web corpora |
dc.subject | CorpusExplorer |
dc.title | CEHugeWebCorpus |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
dc.rights.label | PUB |
has.files | yes |
branding | LRT + Open Submissions |
contact.person | Jan Oliver Rüdiger e-mail@jan-oliver-ruediger.de University of Siegen |
size.info | 3021000000 tokens |
files.size | 15840988419 |
files.count | 1 |
Soubory tohoto záznamu
Licenční kategorie:
Licence: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Publicly Available
Licence: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
- Název
- CEHugeWebCorpus.zip
- Velikost
- 14.75 GB
- Formát
- application/zip
- Popis
- Corpora (CEC6-Format)
- MD5
- 5e60cd05aa408786372f03cc9733b4cf