ORAL2013 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole Czech Republic. The corpus comprises 835 recordings from 2008–2011 that contain 2 785 189 words (i.e. 3 285 508 tokens including punctuation) uttered by 2 544 speakers, out of which 1 297 speakers are unique. ORAL2013 is balanced in the main sociolinguistic categories of speakers (gender, age group, education, region of childhood residence).
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query engine to registered users of the CNC at http://www.korpus.cz
Please note: this item includes only the transcriptions, audio is available under more restrictive non-CC license at http://hdl.handle.net/11234/1-1848
ORTOFON v1 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole Czech Republic. The corpus is composed of 332 recordings from 2012–2017 and contains 1 014 786 orthographic words (i.e. a total of 1 236 508 tokens including punctuation); a total of 624 different speakers appear in the probes. ORTOFON v1 is fully balanced regarding the basic sociolinguistic speaker categories (gender, age group, level of education and region of childhood residence).
The transcription is linked to the corresponding audio track. Unlike the ORAL-series corpora, the transcription was carried out on two main tiers, orthographic and phonetic, supplemented by an additional metalanguage tier. ORTOFON v1 is lemmatized and morphologically tagged. The (anonymized) corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query engine to registered users of the CNC at http://www.korpus.cz
Please note: this item includes only the transcriptions, audio (and the transcripts in their original format) is available under more restrictive non-CC license at http://hdl.handle.net/11234/1-2579
Opera singer Otakar Mařák as a patient at Na Bulovce Hospital in Prague as included in his obituary in a segment from Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) 1939, issue no. 28. The footage with the artist was shot in spring 1938.
Music critic Otakar Šourek on Bohumil Veselý's balcony. Otakar Šourek at Slavín Cemetery in Vyšehrad in a segment from Československý filmový týdeník (Czechoslovak Film Weekly Newsreel) 1954, issue no. 20.
Sculptor Otakar Švec with his colleagues František Žemlička and Arnošt Košík working on a model for the Stalin Monument in a segment from Československý filmový týdeník (Czechoslovak Film Weekly Newsreel) 1953, issue no. 22.
Director Otakar Vávra with cinematographer Jan Roth during the shooting of Humoreska (Humoresque, dir. Otakara Vávra, 1939) in a segment from Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) 1939, issue no. 41B.
The corpus consists of transcribed recordings from the Czech political discussion broadcast “Otázky Václava Moravce“. It contains 35 hours of speech and corresponding word-by-word transcriptions, including the transcription of some non-speech events. Speakers’ names are also assigned to corresponding segments. The resulting corpus is suitable for both acoustic model training for ASR purposes and training of speaker identification and/or verification systems. The archive contains 16 sound files (WAV PCM, 16-bit, 48 kHz, mono) and transcriptions in XML-based standard Transcriber format (http://trans.sourceforge.net)
This package comprises eight models of Czech word embeddings trained by applying word2vec (Mikolov et al. 2013) to the currently most extensive corpus of Czech, namely SYN v9 (Křen et al. 2022). The minimum frequency threshold for including a word in the model was 10 occurrences in the corpus. The original lemmatisation and tagging included in the corpus were used for disambiguation. In the case of word embeddings of word forms, units comprise word forms and their tag from a positional tagset (cf. https://wiki.korpus.cz/doku.php/en:pojmy:tag) separated by '>', e.g., kočka>NNFS1-----A----.
The published package provides models trained on both tokens and lemmas. In addition, the models combine training algorithms (CBOW and Skipgram) and dimensions of the resulting vectors (100 or 500), while the training window and negative sampling remained the same during the training. The package also includes files with frequencies of word forms (vocab-frequencies.forms) and lemmas (vocab-frequencies.lemmas).
The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl. Since the data is fairly raw, it is released with two quality metrics that can be used for corpus filtering. An official "clean" version of each corpus uses one of the metrics. For more details and raw data download please visit: http://paracrawl.eu/releases.html