Number of results to display per page
Search Results
1152. OVM – Otázky Václava Moravce
- Creator:
- Šmídl, Luboš and Pražák, Aleš
- Publisher:
- University of West Bohemia, Department of Cybernetics
- Type:
- audio and corpus
- Subject:
- speech corpus, acoustic model, speaker identification, and speaker verification
- Language:
- Czech
- Description:
- The corpus consists of transcribed recordings from the Czech political discussion broadcast “Otázky Václava Moravce“. It contains 35 hours of speech and corresponding word-by-word transcriptions, including the transcription of some non-speech events. Speakers’ names are also assigned to corresponding segments. The resulting corpus is suitable for both acoustic model training for ASR purposes and training of speaker identification and/or verification systems. The archive contains 16 sound files (WAV PCM, 16-bit, 48 kHz, mono) and transcriptions in XML-based standard Transcriber format (http://trans.sourceforge.net)
- Rights:
- Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0), http://creativecommons.org/licenses/by-nc/3.0/, and PUB
1153. Package of word embeddings of Czech from a large corpus
- Creator:
- Kyjánek, Lukáš and Bonami, Olivier
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, computationalLexicon, and lexicalConceptualResource
- Subject:
- word embeddings, word vectors, large corpus, word2vec, skipgram, and cbow
- Language:
- Czech
- Description:
- This package comprises eight models of Czech word embeddings trained by applying word2vec (Mikolov et al. 2013) to the currently most extensive corpus of Czech, namely SYN v9 (Křen et al. 2022). The minimum frequency threshold for including a word in the model was 10 occurrences in the corpus. The original lemmatisation and tagging included in the corpus were used for disambiguation. In the case of word embeddings of word forms, units comprise word forms and their tag from a positional tagset (cf. https://wiki.korpus.cz/doku.php/en:pojmy:tag) separated by '>', e.g., kočka>NNFS1-----A----. The published package provides models trained on both tokens and lemmas. In addition, the models combine training algorithms (CBOW and Skipgram) and dimensions of the resulting vectors (100 or 500), while the training window and negative sampling remained the same during the training. The package also includes files with frequencies of word forms (vocab-frequencies.forms) and lemmas (vocab-frequencies.lemmas).
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
1154. ParaCrawl Corpus version 1.0
- Creator:
- Koehn, Philipp, Heafield, Kenneth, Forcada, Mikel L., Esplà-Gomis, Miquel, Ortiz-Rojas, Sergio, Sánchez, Gema Ramírez, Cartagena, Víctor M. Sánchez, Haddow, Barry, Bañón, Marta, Střelec, Marek, Samiotou, Anna, and Kamran, Amir
- Publisher:
- ParaCrawl
- Type:
- text and corpus
- Subject:
- ParaCrawl, parallel corpus, CommonCrawl, machine translation, and text corpora
- Language:
- English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Czech, Romanian, Finnish, Latvian, Russian, and Estonian
- Description:
- The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl. Since the data is fairly raw, it is released with two quality metrics that can be used for corpus filtering. An official "clean" version of each corpus uses one of the metrics. For more details and raw data download please visit: http://paracrawl.eu/releases.html
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
1155. ParaDi 2.0
- Creator:
- Barančíková, Petra and Kettnerová, Václava
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, machineReadableDictionary, and lexicalConceptualResource
- Subject:
- multiword expressions, light verb construction, paraphrases, and idioms
- Language:
- Czech
- Description:
- ParaDi 2.0. is a dictionary of single verb paraphrases of Czech verbal multiword expressions - light verb constructions and idiomatic verb constructions. Moreover, it provides an elaborated set of morphological, syntactic and semantic features, including information on aspectual counterparts of verbs or paraphrasability conditions of given verbs. The format of ParaDi has been designed with respect to both human and machine readability - the dictionary is represented as a plain table in TSV format, as it is a flexible and language-independent data format.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
1156. ParaDi 2.0 (2018-01-24)
- Creator:
- Barančíková, Petra and Kettnerová, Václava
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, machineReadableDictionary, and lexicalConceptualResource
- Subject:
- multiword expressions, light verb construction, paraphrases, and idioms
- Language:
- Czech
- Description:
- ParaDi 2.0. is a dictionary of single verb paraphrases of Czech verbal multiword expressions - light verb constructions and idiomatic verb constructions. Moreover, it provides an elaborated set of morphological, syntactic and semantic features, including information on aspectual counterparts of verbs or paraphrasability conditions of given verbs. The format of ParaDi has been designed with respect to both human and machine readability - the dictionary is represented as a plain table in TSV format, as it is a flexible and language-independent data format.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
1157. ParaDi: Dictionary of Paraphrases of Czech Complex Predicates with Light Verbs
- Creator:
- Barančíková, Petra and Kettnerová, Václava
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, machineReadableDictionary, and lexicalConceptualResource
- Subject:
- light verb construction and paraphrases
- Language:
- Czech
- Description:
- Dictionary of single verb paraphrases of Czech light verb constructions.
- Rights:
- Public Domain Mark (PD), http://creativecommons.org/publicdomain/mark/1.0/, and PUB
1158. Parallel Global Voices, Czech-English NER+NEL
- Creator:
- Nevěřilová, Zuzana and Žižková, Hana
- Publisher:
- Masaryk University, Brno
- Type:
- text, other, and lexicalConceptualResource
- Subject:
- named entity recognition, named entities, named entity, named entitity corpus, named entity linking, named entity disambiguation, and wikidata
- Language:
- English and Czech
- Description:
- Annotation of named entities to the existing source Parallel Global Voices, ces-eng language pair. The named entity annotations distinguish four classes: Person, Organization, Location, Misc. The annotation is in the IOB schema (annotation per token, beginning + inside of the multi-word annotation). NEL annotation contains Wikidata Qnames.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
1159. ParCorFull: A Parallel Corpus Annotated with Full Coreference
- Creator:
- Lapshinova-Koltunski, Ekaterina, Hardmeier, Christian, and Krielke, Pauline
- Publisher:
- Universität des Saarlandes and Uppsala University
- Type:
- text and corpus
- Subject:
- parallel corpus, annotated corpus, coreference, and anaphora resolution
- Language:
- English and German
- Description:
- ParCorFull is a parallel corpus annotated with full coreference chains that has been created to address an important problem that machine translation and other multilingual natural language processing (NLP) technologies face -- translation of coreference across languages. Our corpus contains parallel texts for the language pair English-German, two major European languages. Despite being typologically very close, these languages still have systemic differences in the realisation of coreference, and thus pose problems for multilingual coreference resolution and machine translation. Our parallel corpus covers the genres of planned speech (public lectures) and newswire. It is richly annotated for coreference in both languages, including annotation of both nominal coreference and reference to antecedents expressed as clauses, sentences and verb phrases. This resource supports research in the areas of natural language processing, contrastive linguistics and translation studies on the mechanisms involved in coreference translation in order to develop a better understanding of the phenomenon.
- Rights:
- Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB
1160. ParCzech 3.0
- Creator:
- Kopp, Matyáš, Stankov, Vladislav, Bojar, Ondřej, Hladká, Barbora, and Straňák, Pavel
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- audio and corpus
- Subject:
- Parliament of the Czech Republic, Chamber of Deputies, stenographic protocols, TEI encoding, and speech corpus
- Language:
- Czech
- Description:
- The ParCzech 3.0 corpus is the third version of ParCzech consisting of stenographic protocols that record the Chamber of Deputies’ meetings held in the 7th term (2013-2017) and the current 8th term (2017-Mar 2021). The protocols are provided in their original HTML format, Parla-CLARIN TEI format, and the format suitable for Automatic Speech Recognition. The corpus is automatically enriched with the morphological, syntactic, and named-entity annotations using the procedures UDPipe 2 and NameTag 2. The audio files are aligned with the texts in the annotated TEI files.
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB