ORTOFON v3: corpus of informal spoken Czech with multi-tier transcription (transcriptions)

Name: ORTOFON v3: corpus of informal spoken Czech with multi-tier transcription (transcriptions)
License: http://creativecommons.org/licenses/by-nc-sa/4.0/

Lukeš, David; Kopřivová, Marie; Laubeová, Zuzana; Poukarová, Petra; Horký, Václav; Jelínek, Tomáš; Křivan, Jan; Waclawičová, Martina; Benešová, Lucie; Škarpová, Marie

Show simple item record

dc.contributor.author	Lukeš, David
dc.contributor.author	Kopřivová, Marie
dc.contributor.author	Laubeová, Zuzana
dc.contributor.author	Poukarová, Petra
dc.contributor.author	Horký, Václav
dc.contributor.author	Jelínek, Tomáš
dc.contributor.author	Křivan, Jan
dc.contributor.author	Waclawičová, Martina
dc.contributor.author	Benešová, Lucie
dc.contributor.author	Škarpová, Marie
dc.date.accessioned	2024-10-10T10:40:18Z
dc.date.available	2024-10-10T10:40:18Z
dc.date.issued	2024-07-15
dc.identifier.uri	http://hdl.handle.net/11234/1-5687
dc.description	ORTOFON v3 is a corpus of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) that covers the area of the whole Czech Republic. The corpus is composed of 697 recordings from 2012–2020 and contains 2 445 793 orthographic words (i.e. a total of 2 976 742 tokens including punctuation); a total of 1 121 different speakers appear in the probes. ORTOFON v3 is partially balanced regarding the basic sociolinguistic speaker categories (gender, age group, level of education and region of childhood residence). The transcription is linked to the corresponding audio track. Unlike the ORAL-series corpora, the transcription was carried out on two main tiers, orthographic and phonetic, supplemented by an additional metalanguage tier. ORTOFON v3 is lemmatized and morphologically tagged according to the SYN2020 standard. This was performed with special attention paid to the specificity of the informal spoken Czech and includes also spoken training data. The (anonymized) corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query engine to registered users of the CNC at http://www.korpus.cz Please note: this item includes only the transcriptions, audio (and the transcripts in their original format) is available under more restrictive non-CC license at http://hdl.handle.net/11234/1-5686
dc.language.iso	ces
dc.publisher	Charles University, Faculty of Arts, Institute of the Czech National Corpus
dc.relation.replaces	http://hdl.handle.net/11234/1-2580
dc.rights	Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.source.uri	https://wiki.korpus.cz/doku.php/en:cnk:ortofon
dc.subject	spoken language
dc.subject	informal language
dc.title	ORTOFON v3: corpus of informal spoken Czech with multi-tier transcription (transcriptions)
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
dc.rights.label	PUB
has.files	yes
branding	LINDAT / CLARIAH-CZ
demo.uri	https://www.korpus.cz/kontext/query?corpname=ortofon_v3
contact.person	Michal Křen michal.kren@ff.cuni.cz Charles University, Faculty of Arts, Institute of the Czech National Corpus
sponsor	Ministerstvo školství, mládeže a tělovýchovy LM2023044 Český národní korpus nationalFunds
size.info	2400000 words
files.size	38769743
files.count	1

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Name: ortofon_v3_vert.gz
Size: 36.97 MB
Format: application/x-gzip
Description: the data
MD5: ab3f38428013d5e3f12982423302584d

Download file

Show simple item record