Number of results to display per page
Search Results
1622. OVM – Otázky Václava Moravce
- Creator:
- Šmídl, Luboš and Pražák, Aleš
- Publisher:
- University of West Bohemia, Department of Cybernetics
- Type:
- audio and corpus
- Subject:
- speech corpus, acoustic model, speaker identification, and speaker verification
- Language:
- Czech
- Description:
- The corpus consists of transcribed recordings from the Czech political discussion broadcast “Otázky Václava Moravce“. It contains 35 hours of speech and corresponding word-by-word transcriptions, including the transcription of some non-speech events. Speakers’ names are also assigned to corresponding segments. The resulting corpus is suitable for both acoustic model training for ASR purposes and training of speaker identification and/or verification systems. The archive contains 16 sound files (WAV PCM, 16-bit, 48 kHz, mono) and transcriptions in XML-based standard Transcriber format (http://trans.sourceforge.net)
- Rights:
- Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0), http://creativecommons.org/licenses/by-nc/3.0/, and PUB
1623. Oxford Text Archive
- Type:
- corpus
- Language:
- English
- Description:
- Electronic texts, corpora, lexicons. other
- Rights:
- Not specified
1624. Package of word embeddings of Czech from a large corpus
- Creator:
- Kyjánek, Lukáš and Bonami, Olivier
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, computationalLexicon, and lexicalConceptualResource
- Subject:
- word embeddings, word vectors, large corpus, word2vec, skipgram, and cbow
- Language:
- Czech
- Description:
- This package comprises eight models of Czech word embeddings trained by applying word2vec (Mikolov et al. 2013) to the currently most extensive corpus of Czech, namely SYN v9 (Křen et al. 2022). The minimum frequency threshold for including a word in the model was 10 occurrences in the corpus. The original lemmatisation and tagging included in the corpus were used for disambiguation. In the case of word embeddings of word forms, units comprise word forms and their tag from a positional tagset (cf. https://wiki.korpus.cz/doku.php/en:pojmy:tag) separated by '>', e.g., kočka>NNFS1-----A----. The published package provides models trained on both tokens and lemmas. In addition, the models combine training algorithms (CBOW and Skipgram) and dimensions of the resulting vectors (100 or 500), while the training window and negative sampling remained the same during the training. The package also includes files with frequencies of word forms (vocab-frequencies.forms) and lemmas (vocab-frequencies.lemmas).
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
1625. PALIC
- Publisher:
- Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
- Type:
- toolService
- Language:
- Catalan, French, Portuguese, and Spanish
- Description:
- A package of tools for the processing of the Corpus Tècnic in Catalan and Spanish. It includes a preprocessor, a PoSTagger and a linguistic disambiguator.
- Rights:
- Not specified
1626. panacea_conversor
- Publisher:
- Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
- Type:
- toolService
- Description:
- Format conversion service: Panacea conversion tool
- Rights:
- Not specified
1627. ParaCrawl Corpus version 1.0
- Creator:
- Koehn, Philipp, Heafield, Kenneth, Forcada, Mikel L., Esplà-Gomis, Miquel, Ortiz-Rojas, Sergio, Sánchez, Gema Ramírez, Cartagena, Víctor M. Sánchez, Haddow, Barry, Bañón, Marta, Střelec, Marek, Samiotou, Anna, and Kamran, Amir
- Publisher:
- ParaCrawl
- Type:
- text and corpus
- Subject:
- ParaCrawl, parallel corpus, CommonCrawl, machine translation, and text corpora
- Language:
- English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Czech, Romanian, Finnish, Latvian, Russian, and Estonian
- Description:
- The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl. Since the data is fairly raw, it is released with two quality metrics that can be used for corpus filtering. An official "clean" version of each corpus uses one of the metrics. For more details and raw data download please visit: http://paracrawl.eu/releases.html
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
1628. ParaDi 2.0
- Creator:
- Barančíková, Petra and Kettnerová, Václava
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, machineReadableDictionary, and lexicalConceptualResource
- Subject:
- multiword expressions, light verb construction, paraphrases, and idioms
- Language:
- Czech
- Description:
- ParaDi 2.0. is a dictionary of single verb paraphrases of Czech verbal multiword expressions - light verb constructions and idiomatic verb constructions. Moreover, it provides an elaborated set of morphological, syntactic and semantic features, including information on aspectual counterparts of verbs or paraphrasability conditions of given verbs. The format of ParaDi has been designed with respect to both human and machine readability - the dictionary is represented as a plain table in TSV format, as it is a flexible and language-independent data format.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
1629. ParaDi 2.0 (2018-01-24)
- Creator:
- Barančíková, Petra and Kettnerová, Václava
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, machineReadableDictionary, and lexicalConceptualResource
- Subject:
- multiword expressions, light verb construction, paraphrases, and idioms
- Language:
- Czech
- Description:
- ParaDi 2.0. is a dictionary of single verb paraphrases of Czech verbal multiword expressions - light verb constructions and idiomatic verb constructions. Moreover, it provides an elaborated set of morphological, syntactic and semantic features, including information on aspectual counterparts of verbs or paraphrasability conditions of given verbs. The format of ParaDi has been designed with respect to both human and machine readability - the dictionary is represented as a plain table in TSV format, as it is a flexible and language-independent data format.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
1630. ParaDi: Dictionary of Paraphrases of Czech Complex Predicates with Light Verbs
- Creator:
- Barančíková, Petra and Kettnerová, Václava
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, machineReadableDictionary, and lexicalConceptualResource
- Subject:
- light verb construction and paraphrases
- Language:
- Czech
- Description:
- Dictionary of single verb paraphrases of Czech light verb constructions.
- Rights:
- Public Domain Mark (PD), http://creativecommons.org/publicdomain/mark/1.0/, and PUB