Original context has metadata only: false / Rights: ACA - LINDAT/CLARIAH-CZ Catalog Search Results

1. Czech Web Corpus 2017 (csTenTen17)

Creator:: Suchomel, Vít
Publisher:: Masaryk University, NLP Centre and Lexical Computing CZ s.r.o.
Type:: text and corpus
Subject:: Web corpus
Language:: Czech
Description:: The Czech Web Corpus 2017 (csTenTen17) is a Czech corpus made up of texts collected from the Internet, mostly from the Czech national top level domain ".cz". The data was crawled by web crawler SpiderLing (https://corpus.tools/wiki/SpiderLing). The data was cleaned by removing boilerplate (using https://corpus.tools/wiki/Justext), removing near-duplicate paragraphs (by https://corpus.tools/wiki/Onion) and discarding paragraphs not in the target language. The corpus was POS annotated by morphological analyser Majka using this POS tagset: https://www.sketchengine.eu/tagset-reference-for-czech/. Text sources: General web, Wikipedia. Time span of crawling: May, October and November 2017, October and November 2016, October and November 2015. The Czech Wikipedia part was downloaded in November 2017. Data format: Plain text, vertical (one token per line), gzip compressed. There are the following structures in the vertical: Documents (<doc/>, usually corresponding to web pages), paragraphs (<p/>), sentences (<s/>) and word join markers (<g/>, a "glue" tag indicating that there was no space between the surrounding tokens in the original text). Document metadata: src (the source of the data), title (the title of the web page), url (the URL of the document), crawl_date (the date of downloading the document). Paragraph metadata: heading ("1" if the paragraph is a heading, usually <h1> to <h6> elements in the original HTML data). Block elements in the case of an HTML source or double blank lines in the case of other source formats were used as paragraph separators. An internal heuristic tool was used to mark sentence breaks. The tab-separated positional attributes are: word form, morphological annotation, lem-POS (the base form of the word, i.e. the lemma, with a part of speech suffix) and gender respecting lemma (nouns and adjectives only). Please cite the following paper when using the corpus for your research: Suchomel, Vít. csTenTen17, a Recent Czech Web Corpus. In Recent Advances in Slavonic Natural Language Processing, pp. 111–123. 2018. (https://nlp.fi.muni.cz/raslan/raslan18.pdf#page=119)
Rights:: NLP Centre Web Corpus License, https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC, and ACA

2. enTenTen

Creator:: (:unav) Unknown author
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: English large corpus
Language:: English
Description:: Very large English web corpus enTenTEn, comprising 3,268,798,627 tokens. and Lexical Computing Ltd.
Rights:: NLP Centre Web Corpus License, https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC, and ACA

3. Feature-based tagger

Creator:: Hajič, Jan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: toolService
Subject:: morphology and tagger
Description:: The Feature-based (exponential model) Tagger is a fast implementation of the Czech tagger developed at UFAL and described in the PDT 1.0 documentation (Czech Language Tagging page). In order to get the best possible results, the tagger requires preprocessing by a Czech morphological module with a very high coverage. This module covers a superset of the Czech "FM" morphology. Both the morphological module and the tagger are supplied as binary executables, together with all necessary precompiled Czech data. Input must be in the ISO Latin 2 (iso-8859-2) code and follow the csts.dtd definition, and output is produced in the same way (ISO Latin 2 code, csts.dtd). (As is the case with many of the tools provided with PDT 1.0, both executables also accept - and then produce - a "simplified SGML", which is not a real, valid SGML, but simply contains at least the tags for words, punctuation, and sentence breaks, one item per line.)
Rights:: PDT 2.0 License, https://lindat.mff.cuni.cz/repository/xmlui/page/license-pdt2, and ACA

4. HamleDT 2.0

Creator:: Zeman, Daniel, Mareček, David, Mašek, Jan, Popel, Martin, Ramasamy, Loganathan, Rosa, Rudolf, Štěpánek, Jan, and Žabokrtský, Zdeněk
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: treebank, Stanford dependencies, Prague dependencies, harmonization, common annotation style, and Interset
Language:: Arabic, Bulgarian, Bengali, Catalan, Czech, Danish, German, Modern Greek (1453-), English, Spanish, Estonian, Basque, Persian, Finnish, Ancient Greek (to 1453), Hindi, Hungarian, Italian, Japanese, Latin, Dutch, Portuguese, Romanian, Russian, Slovak, Slovenian, Swedish, Tamil, Telugu, and Turkish
Description:: HamleDT 2.0 is a collection of 30 existing treebanks harmonized into a common annotation style, the Prague Dependencies, and further transformed into Stanford Dependencies, a treebank annotation style that became popular recently. We use the newest basic Universal Stanford Dependencies, without added language-specific subtypes.
Rights:: HamleDT 2.0 Licence Agreement, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-hamledt-2.0, and ACA

5. Indonesian web corpus

Creator:: MEDVEĎ, MAREK and Suchomel, Vít
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: Web corpus
Language:: Indonesian
Description:: Indonesian web corpus crawled in 2010. Encoded in UTF-8, cleaned, deduplicated, tagged by Morphind.
Rights:: NLP Centre Web Corpus License, https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC, and ACA

6. Indonesian web corpus (idWac)

Creator:: Medveď, Marek and Suchomel, Vít
Publisher:: Natural Language Processing Centre, Faculty of Informatics, Masaryk University
Type:: text and corpus
Subject:: corpus, lemmatization, and PoS tagging
Language:: Indonesian
Description:: Indonesian text corpus from web. Crawling done by SpiderLing in 2017. Filtering by JusText and Onion (see http://corpus.tools/ for details). Tagged and lemmatized by MorphInd (http://septinalarasati.com/morphind/).
Rights:: NLP Centre Web Corpus License, https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC, and ACA

7. Languages in Migration

Creator:: Bučková, Aneta, Nekula, Marek, Lukeš, David, Woźniak, Michał, Wastl, Michael, and Polowy, Louisa
Publisher:: Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague and Universität Regensburg
Type:: text and corpus
Subject:: spoken language, bilingual, syntactic annotation, migrant language, narrative interviews, and language biography
Language:: German and Czech
Description:: LANGUAGES IN MIGRATION is designed as a representation of authentic spoken Czech and German that is used in informal speech (private environment, spontaneity, unpreparedness etc.) by Czech-German bilingual speakers born in Czechoslovakia around 1955 and who departed for Germany after becoming 12 years old. The corpus is composed of interviews conducted from 2018–2020 with 20 speakers on language biographies and narrated in Czech and German respectively. 10 interviews were recorded with late (German) repatriates and 10 with Czech migrants. The corpus includes transcripts of ca. 14 hours of Czech recordings and ca. 13,5 hours of German recordings. It contains 217 650 orthographic words (i.e. a total of 286 533 tokens including punctuation). Metadata of LANGUAGES IN MIGRATION include basic sociolinguistically relevant speaker categories (gender, year of birth and of migration, level of education and region of childhood and present residence). The transcription of LANGUAGES IN MIGRATION is linked to the corresponding audio track. The transcription was carried out on the orthographic tier and supplemented by an additional metalanguage tier. The corpus LANGUAGES IN MIGRATION is lemmatized and morphologically tagged in different formats for Czech and German (Stuttgart-Tübingen-Tagset). Deviations from the norm of the spoken Czech and German of the homeland, which are understood as the result of language contact and language isolation, are tagged in a further tier both in the Czech and in the German sub-corpuses of LANGUAGES IN MIGRATION. The (anonymized) corpus is provided in form of transcripts in EAF format, which can be viewed via the freely available ELAN program, and a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query engine to registered users of the CNC at http://www.korpus.cz
Rights:: Czech National Corpus (Shuffled Corpus Data), https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc, and ACA

8. MORFO

Creator:: Kolovratník, David
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: toolService
Subject:: morphological analysis
Language:: Czech
Description:: The MORFO system for morphological analysis of Czech consists of four units: the analyzer, the generator, the dictionary editor, and the library with the shared source code for handling dictionary objects.
Rights:: PDT 2.0 License, https://lindat.mff.cuni.cz/repository/xmlui/page/license-pdt2, and ACA

9. ORAL2013: balanced corpus of informal spoken Czech (transcriptions & audio)

Creator:: Benešová, Lucie, Křen, Michal, and Waclawičová, Martina
Publisher:: Charles University, Faculty of Arts, Institute of the Czech National Corpus
Type:: audio and corpus
Subject:: balanced corpus, spoken language, and speech corpus
Language:: Czech
Description:: ORAL2013 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole Czech Republic. The corpus comprises 835 recordings from 2008–2011 that contain 2 785 189 words (i.e. 3 285 508 tokens including punctuation) uttered by 2 544 speakers, out of which 1 297 speakers are unique. ORAL2013 is balanced in the main sociolinguistic categories of the speakers (gender, age group, education, region of childhood residence). The (anonymized) transcriptions are provided in the Transcriber XML format, audio (with corresponding anonymization beeps) is in uncompressed 16-bit PCM WAV, mono, 16 kHz format. Another format option of the transcriptions is also available under less restrictive CC BY-NC-SA license at http://hdl.handle.net/11234/1-1847
Rights:: License Agreement for Czech National Corpus Data, https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc-data, and ACA

10. ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcription (transcriptions & audio)

Creator:: Kopřivová, Marie, Komrsková, Zuzana, Lukeš, David, Poukarová, Petra, and Škarpová, Marie
Publisher:: Charles University, Faculty of Arts, Institute of the Czech National Corpus
Type:: audio and corpus
Subject:: balanced corpus, spoken language, informal language, and Czech
Language:: Czech
Description:: ORTOFON v1 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole Czech Republic. The corpus is composed of 332 recordings from 2012–2017 and contains 1 014 786 orthographic words (i.e. a total of 1 236 508 tokens including punctuation); a total of 624 different speakers appear in the probes. ORTOFON v1 is fully balanced regarding the basic sociolinguistic speaker categories (gender, age group, level of education and region of childhood residence). The transcription is linked to the corresponding audio track. Unlike the ORAL-series corpora, the transcription was carried out on two main tiers, orthographic and phonetic, supplemented by an additional metalanguage tier. ORTOFON v1 is lemmatized and morphologically tagged. The (anonymized) transcriptions are provided in the XML Elan Annotation format, audio (with corresponding anonymization beeps) is in uncompressed 16-bit PCM WAV, mono, 16 kHz format. Another format option of the transcriptions is also available under less restrictive CC BY-NC-SA license at http://hdl.handle.net/11234/1-2580
Rights:: License Agreement for Czech National Corpus Data, https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc-data, and ACA

1. Czech Web Corpus 2017 (csTenTen17)

2. enTenTen

3. Feature-based tagger

4. HamleDT 2.0

5. Indonesian web corpus

6. Indonesian web corpus (idWac)

7. Languages in Migration

8. MORFO

9. ORAL2013: balanced corpus of informal spoken Czech (transcriptions & audio)

10. ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcription (transcriptions & audio)

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Creator

Show values starting with

Language

Show values starting with

Publisher

Rights

Show values starting with

Subject

Show values starting with

Type

Original context has metadata only

Harvested from