Original context has metadata only: false / Rights: ACA - LINDAT/CLARIAH-CZ Catalog Search Results

11. On-line Dictionary of medieval latin in the Czech lands

Creator:: Ctibor, Jan and Nývlt, Pavel
Publisher:: Institute of Philosophy of the Czech Academy of Sciences
Type:: text, lexicon, and lexicalConceptualResource
Subject:: dictionary, latin, Medieval, digital humanities, lexicography, and Medieval Latin
Language:: Latin and Czech
Description:: The Dictionary of Medieval Latin in the Czech Lands registers and explains the vocabulary of Medieval Latin as used in the Czech lands since the beginnings of Latin writing in this area (from about 1000 CE) to 1500 CE, so far covering the letters A-M. For more information about the Dictionary, see the webpage of the Department of Medieval Lexicography of the Institute of Philosophy of Czech Academy of Sciences. The data uploaded present the on-line version of the dictionary (API and XML data), making it possible to put the application into operation at a localhost.
Rights:: Dictionary of Medieval Latin in the Czech Lands - digital version 2.2 License Agreement, https://lindat.mff.cuni.cz/repository/xmlui/page/license-lb, and ACA

12. ORAL2013: balanced corpus of informal spoken Czech (transcriptions & audio)

Creator:: Benešová, Lucie, Křen, Michal, and Waclawičová, Martina
Publisher:: Charles University, Faculty of Arts, Institute of the Czech National Corpus
Type:: audio and corpus
Subject:: balanced corpus, spoken language, and speech corpus
Language:: Czech
Description:: ORAL2013 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole Czech Republic. The corpus comprises 835 recordings from 2008–2011 that contain 2 785 189 words (i.e. 3 285 508 tokens including punctuation) uttered by 2 544 speakers, out of which 1 297 speakers are unique. ORAL2013 is balanced in the main sociolinguistic categories of the speakers (gender, age group, education, region of childhood residence). The (anonymized) transcriptions are provided in the Transcriber XML format, audio (with corresponding anonymization beeps) is in uncompressed 16-bit PCM WAV, mono, 16 kHz format. Another format option of the transcriptions is also available under less restrictive CC BY-NC-SA license at http://hdl.handle.net/11234/1-1847
Rights:: License Agreement for Czech National Corpus Data, https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc-data, and ACA

13. Oromo web corpus

Creator:: Suchomel, Vít and Rychlý, Pavel
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: text corpora, Ethiopian languages, Oromo, Web corpus, and under-resourced language
Language:: Oromo
Description:: Oromo web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.
Rights:: NLP Centre Web Corpus License, https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC, and ACA

14. ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcription (transcriptions & audio)

Creator:: Kopřivová, Marie, Komrsková, Zuzana, Lukeš, David, Poukarová, Petra, and Škarpová, Marie
Publisher:: Charles University, Faculty of Arts, Institute of the Czech National Corpus
Type:: audio and corpus
Subject:: balanced corpus, spoken language, informal language, and Czech
Language:: Czech
Description:: ORTOFON v1 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole Czech Republic. The corpus is composed of 332 recordings from 2012–2017 and contains 1 014 786 orthographic words (i.e. a total of 1 236 508 tokens including punctuation); a total of 624 different speakers appear in the probes. ORTOFON v1 is fully balanced regarding the basic sociolinguistic speaker categories (gender, age group, level of education and region of childhood residence). The transcription is linked to the corresponding audio track. Unlike the ORAL-series corpora, the transcription was carried out on two main tiers, orthographic and phonetic, supplemented by an additional metalanguage tier. ORTOFON v1 is lemmatized and morphologically tagged. The (anonymized) transcriptions are provided in the XML Elan Annotation format, audio (with corresponding anonymization beeps) is in uncompressed 16-bit PCM WAV, mono, 16 kHz format. Another format option of the transcriptions is also available under less restrictive CC BY-NC-SA license at http://hdl.handle.net/11234/1-2580
Rights:: License Agreement for Czech National Corpus Data, https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc-data, and ACA

15. ORTOFON v3: corpus of informal spoken Czech with multi-tier transcription (transcriptions & audio)

Creator:: Lukeš, David, Kopřivová, Marie, Laubeová, Zuzana, Poukarová, Petra, Horký, Václav, Jelínek, Tomáš, Křivan, Jan, Waclawičová, Martina, Benešová, Lucie, and Škarpová, Marie
Publisher:: Charles University, Faculty of Arts, Institute of the Czech National Corpus
Type:: audio and corpus
Subject:: spoken language and informal language
Language:: Czech
Description:: ORTOFON v3 is a corpus of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) that covers the area of the whole Czech Republic. The corpus is composed of 697 recordings from 2012–2020 and contains 2 445 793 orthographic words (i.e. a total of 2 976 742 tokens including punctuation); a total of 1 121 different speakers appear in the probes. ORTOFON v3 is partially balanced regarding the basic sociolinguistic speaker categories (gender, age group, level of education and region of childhood residence). The transcription is linked to the corresponding audio track. Unlike the ORAL-series corpora, the transcription was carried out on two main tiers, orthographic and phonetic, supplemented by an additional metalanguage tier. The (anonymized) transcriptions are provided in the XML Elan Annotation format, audio (with corresponding anonymization beeps) is in uncompressed 16-bit PCM WAV, mono, 16 kHz format. Another format option of the transcriptions is also available under less restrictive CC BY-NC-SA license at http://hdl.handle.net/11234/1-5687
Rights:: License Agreement for Czech National Corpus Data, ACA, and https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc-data

16. Prague Dependency Treebank 2.0 (PDT 2.0)

Creator:: Hajič, Jan, Panevová, Jarmila, Hajičová, Eva, Sgall, Petr, Pajas, Petr, Štěpánek, Jan, Havelka, Jiří, Mikulová, Marie, Žabokrtský, Zdeněk, Ševčíková-Razímová, Magda, and Urešová, Zdeňka
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: corpus, Czech, treebank, and PDT
Language:: Czech
Description:: The Prague Dependency Treebank 2.0 (PDT 2.0) contains a large amount of Czech texts with complex and interlinked morphological (two million words), syntactic (1.5 MW) and complex semantic annotation (0.8 MW); in addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level. PDT 2.0 is based on the long-standing Praguian linguistic tradition, adapted for the current Computational Linguistics research needs. The corpus itself uses the latest annotation technology. Software tools for corpus search, annotation and language analysis are included. Extensive documentation (in English) is provided as well. and 1ET101120413 (Data a nástroje pro informační systémy) MSM 0021620838 (Moderní metody, struktury a systémy informatiky) 1ET101120503 (Integrace jazykových zdrojů za účelem extrakce informací z přirozených textů) 1P05ME752 (Vícejazyčný valenční a predikátový slovník přirozeného jazyka) LC536 (Centrum komputační lingvistiky)
Rights:: PDT 2.0 License, https://lindat.mff.cuni.cz/repository/xmlui/page/license-pdt2, and ACA

17. Prague Dependency Treebank of Spoken Language (PDTSL) 0.5

Creator:: Hajič, Jan, Pajas, Petr, Mareček, David, Mikulová, Marie, Urešová, Zdeňka, and Podveský, Petr
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: audio and corpus
Subject:: corpus and spoken language
Language:: Czech and English
Description:: The first edition of a speech corpus with a speech reconstruction layer (edited transcript). The project of speech reconstruction of Czech and English has been started at UFAL together with the PIRE project in 2005, and has gradually grown from ideas to (first) annotation specification, annotation software and actual annotation. It is part of the Prague Dependency Treebank family of annotated corpus resources and tools, to which it adds the spoken language layer(s). and LC536; MSM0021620838; IST-034344; ME838
Rights:: PDTSL, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-pdtsl, and ACA

18. Somali Web Corpus

Creator:: Suchomel, Vít and Rychlý, Pavel
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: text corpora, Ethiopian languages, web corpora, under-resourced languages, and Somali
Language:: Somali
Description:: Somali web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.
Rights:: NLP Centre Web Corpus License, https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC, and ACA

19. SYN v4: large corpus of written Czech

Creator:: Křen, Michal, Cvrček, Václav, Čapka, Tomáš, Čermáková, Anna, Hnátková, Milena, Chlumská, Lucie, Jelínek, Tomáš, Kováříková, Dominika, Petkevič, Vladimír, Procházka, Pavel, Skoumalová, Hana, Škrabal, Michal, Truneček, Petr, Vondřička, Pavel, and Zasina, Adrian
Publisher:: Charles University, Faculty of Arts, Institute of the Czech National Corpus
Type:: text and corpus
Subject:: corpus and written language
Language:: Czech
Description:: Corpus of contemporary written (printed) Czech sized 3.6 GW (i.e. 4.3 billion tokens). It covers mostly the period of 1990–2014 and it is a traditional corpus (as opposed to the web-crawled corpora) with rich metadata containing bibliographical information etc. Although it contains a wide range of text types (fiction, non-fiction, newspapers), the newspapers prevail noticeably. The corpus is lemmatized and morphologically annotated by a combination of stochastic and rule-based methods. The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query interface to registered users of the CNC at http://www.korpus.cz with one important exception: the corpus are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) with ordering randomized within the given document.
Rights:: Czech National Corpus (Shuffled Corpus Data), https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc, and ACA

20. SYN v9: large corpus of written Czech

Creator:: Křen, Michal, Cvrček, Václav, Henyš, Jan, Hnátková, Milena, Jelínek, Tomáš, Kocek, Jan, Kováříková, Dominika, Křivan, Jan, Milička, Jiří, Petkevič, Vladimír, Procházka, Pavel, Skoumalová, Hana, Šindlerová, Jana, and Škrabal, Michal
Publisher:: Charles University, Faculty of Arts, Institute of the Czech National Corpus
Type:: text and corpus
Subject:: corpus and written language
Language:: Czech
Description:: Corpus of contemporary written (printed) Czech sized 4.7 GW (i.e. 5.7 billion tokens). It covers mostly the 1990-2019 period and features rich metadata including detailed bibliographical information, text-type classification etc. SYN v9 contains a wide variety of text types (fiction, non-fiction, newspapers), but the newspapers prevail noticeably. The corpus is lemmatized and morphologically tagged by the new CNC tagset first utilized for the annotation of the SYN2020 corpus. SYN v9 is provided in a CoNLL-U-like vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query interface to the registered users of CNC at http://www.korpus.cz with one important exception: the corpus is shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) with ordering randomized within the given document.
Rights:: Czech National Corpus (Shuffled Corpus Data), https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc, and ACA

11. On-line Dictionary of medieval latin in the Czech lands

12. ORAL2013: balanced corpus of informal spoken Czech (transcriptions & audio)

13. Oromo web corpus

14. ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcription (transcriptions & audio)

15. ORTOFON v3: corpus of informal spoken Czech with multi-tier transcription (transcriptions & audio)

16. Prague Dependency Treebank 2.0 (PDT 2.0)

17. Prague Dependency Treebank of Spoken Language (PDTSL) 0.5

18. Somali Web Corpus

19. SYN v4: large corpus of written Czech

20. SYN v9: large corpus of written Czech

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Creator

Show values starting with

Language

Show values starting with

Publisher

Rights

Show values starting with

Subject

Show values starting with

Type

Date

Original context has metadata only

Harvested from