Harvested from: LINDAT/CLARIAH-CZ repository - LINDAT/CLARIAH-CZ Catalog Search Results

841. SQAD v2

Creator:: Medveď, Marek, Horák, Aleš, and Šulganová, Terézia
Publisher:: Natural Language Processing Centre, Faculty of Informatics, Masaryk University
Type:: text and corpus
Subject:: question answering, Czech, and Simple Question Answering Database
Language:: Czech
Description:: Simple question answering database (SQAD) created from Czech Wikipedia. Each record of SQAD consist of four files (in vertical form provided with lemmatization and POS tagging) and two metadata files.
Rights:: GNU General Public Licence, version 3, http://opensource.org/licenses/GPL-3.0, and PUB

842. Sri Lanka Malay corpus

Publisher:: Universiteit van Amsterdam
Type:: corpus
Description:: Documentation of the Sri Lanka Malay project (DoBeS project)
Rights:: Code of conduct

843. SrpRec - Serbian morphological electronic dictionary

Type:: lexicalConceptualResource
Language:: Serbian
Description:: ~83.000 lemmata; ~ 1.200.000 word forms, LADL-format
Rights:: Not specified

844. St Andrews corpus of Ancient Egyptian

Publisher:: University of St. Andrews
Type:: corpus
Description:: Collection of Ancient Egyptian texts, containing hieroglyphs, a transliteration and a translation.
Rights:: Not specified

845. STAZKA – Speech recordings from vehicles

Creator:: Šmídl, Luboš, Stanislav, Petr, and Radová, Vlasta
Publisher:: University of West Bohemia, Department of Cybernetics
Type:: audio and corpus
Subject:: speech corpus, noisy speech, voice activity detector, and speech recognition
Language:: Czech
Description:: The database actually contains two sets of recordings, both recorded in the moving or stationary vehicles (passenger cars or trucks). All data were recorded within the project “Intelligent Electronic Record of the Operation and Vehicle Performance” whose aim is to develop a voice-operated software for registering the vehicle operation data. The first part (full_noises.zip) consists of relatively long recordings from the vehicle cabin, containing spontaneous speech from the vehicle crew. The recordings are accompanied with detailed transcripts in the Transcriber XML-based format (.trs). Due to the recording settings, the audio contains many different noises, only sparsely interspersed with speech. As such, the set is suitable for robust estimation of the voice activity detector parameters. The second set (prompts.zip) consists of short prompts that were recorded in the controlled setting – the speakers either answered simple questions or they repeated commands and short phrases. The prompts were recorded by 26 different speakers. Each speaker recorded at least two sessions (with identical set of prompts) – first in stationary vehicle, with low level of noise (those recordings are marked by –A_ in the file name) and second while actually driving the car (marked by –B_ or, since several speakers recorded 3 sessions, by –C_). The recordings from this set are suitable mostly for training of the robust domain-specific speech recognizer and also ASR test purposes.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

846. Stuttgart Finite State Transducer Tools

Publisher:: University of Stuttgart
Type:: toolService
Description:: SFST is a finite state transducer toolkit for the implementation of morphologies and other applications of finite state transducers. SFST comprises a compiler and several tools for transforming, printing and applying transducers.
Rights:: Not specified

847. STYX

Creator:: Kučera, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: toolService
Subject:: education, morphology, and syntax
Language:: Czech
Description:: The STYX system is an electronic exercise book for practising Czech morphology and syntax consisting of more than 11, 000 sentences.
Rights:: GNU General Public Licence, version 3, http://opensource.org/licenses/GPL-3.0, and PUB

848. Subtitle Word Frequencies

Publisher:: Center for Reading Research, Ghent University
Type:: lexicalConceptualResource
Language:: Chinese, Dutch, English, German, Modern Greek (1453-), and Spanish
Rights:: Not specified

849. SumeCzech

Creator:: Straka, Milan, Mediankin, Nikita, Kocmi, Tom, Žabokrtský, Zdeněk, Hudeček, Vojtěch, and Hajič, Jan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: summarization, SumeCzech, and Rouge
Language:: Czech
Description:: This entry contains the SumeCzech dataset and the metric RougeRAW used for evaluation. Both the dataset and the metric are described in the paper "SumeCzech: Large Czech News-Based Summarization Dataset" by Milan Straka et al. The dataset is distributed as a set of Python scripts which download the raw HTML pages from CommonCrawl and then process them into the required format. The MPL 2.0 license applies to the scripts downloading the dataset and to the RougeRAW implementation. Note: sumeczech-1.0-update-230225.zip is the updated release of the SumeCzech download script, including the original RougeRAW evaluation metric. The download script was modified to use the updated CommonCraw download URL and to support Python 3.10 and Python 3.11. However, the downloaded dataset is still exactly the same. The original archive sumeczech-1.0.zip was renamed to sumeczech-1.0-obsolete-180213.zip and is kept for reference.
Rights:: Mozilla Public License 2.0, http://opensource.org/licenses/MPL-2.0, and PUB

850. SumeCzech-NER

Creator:: Marek, Petr and Müller, Štěpán
Publisher:: Czech Technical University in Prague
Type:: text and corpus
Subject:: SumeCzech, named entity recognition, named entitity corpus, and summarization
Language:: Czech
Description:: SumeCzech-NER SumeCzech-NER contains named entity annotations of SumeCzech 1.0 (Straka et al. 2018, SumeCzech: Large Czech News-Based Summarization Dataset). Format The dataset is split into four files. Files are in jsonl format. There is one JSON object on each line of the file. The most important fields of JSON objects are: - dataset: train, dev, test, oodtest - ne_abstract: list of named entity annotations of article's abstract - ne_headline: list of named entity annotations of article's headline - ne_text: list of name entity annotations of article's text - url: article's URL that can be used to match article across SumeCzech and SumeCzech-NER Annotations We used SpaCy's NER model trained on CoNLL-based extended CNEC 2.0. The model achieved a 78.45 F-Score on the dataset's testing set. The annotations are in IOB2 format. The entity types are: Numbers in addresses, Geographical names, Institutions, Media names, Artifact names, Personal names, and Time expressions. Tokenization We used the following Python code for tokenization: from typing import List from nltk.tokenize import word_tokenize def tokenize(text: str) -> List[str]: for mark in ('.', ',', '?', '!', '-', '–', '/'): text = text.replace(mark, f' {mark} ') tokens = word_tokenize(text) return tokens
Rights:: Mozilla Public License 2.0, http://opensource.org/licenses/MPL-2.0, and PUB

841. SQAD v2

842. Sri Lanka Malay corpus

843. SrpRec - Serbian morphological electronic dictionary

844. St Andrews corpus of Ancient Egyptian

845. STAZKA – Speech recordings from vehicles

846. Stuttgart Finite State Transducer Tools

847. STYX

848. Subtitle Word Frequencies

849. SumeCzech

850. SumeCzech-NER

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Show values starting with

Creator

Show values starting with

Format

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Show values starting with

Original context has metadata only

Harvested from