LINDAT / CLARIAH-CZ Data & ToolsData and Tools from partner institutions of LINDAT/CLARIAH-CZ project, formerly LINDAT/CLARIN.http://hdl.handle.net/11858/00-097C-0000-0001-4877-A2024-02-23T22:40:47Z2024-02-23T22:40:47ZGrandStaff-LMX: Linearized MusicXML Encoding of the GrandStaff DatasetMayer, JiříStraka, MilanHajič jr., JanPecina, Pavelhttp://hdl.handle.net/11234/1-54232024-02-19T08:45:20Z2024-02-12T00:00:00ZGrandStaff-LMX: Linearized MusicXML Encoding of the GrandStaff Dataset
Mayer, Jiří; Straka, Milan; Hajič jr., Jan; Pecina, Pavel
The GrandStaff-LMX dataset is based on the GrandStaff dataset described in the "End-to-end optical music recognition for pianoform sheet music" paper by Antonio Ríos-Vila et al., 2023, https://doi.org/10.1007/s10032-023-00432-z .
The GrandStaff-LMX dataset contains MusicXML and Linearized MusicXML encodings of all systems from the original datase, suitable for evaluation with the TEDn metric. It also contains the GrandStaff official train/dev/split.
2024-02-12T00:00:00ZOLiMPiC 1.0: OpenScore Lieder Linearized MusicXML Piano CorpusMayer, JiříStraka, MilanHajič jr., JanPecina, Pavelhttp://hdl.handle.net/11234/1-54192024-02-19T08:44:08Z2024-02-12T00:00:00ZOLiMPiC 1.0: OpenScore Lieder Linearized MusicXML Piano Corpus
Mayer, Jiří; Straka, Milan; Hajič jr., Jan; Pecina, Pavel
OLiMPiC: OpenScore Lieder Linearized MusicXML Piano Corpus is a dataset containing synthetic and scanned images of pianoform music scores. The scores and the scanned images originate from the OpenScore Lieder Corpus https://github.com/OpenScore/Lieder .
OLiMPiC contains the scores in MusicXML and Linearized MusicXML encoding, suitable for evaluation with the TEDn metric. The official train/dev/test split is also provided.
2024-02-12T00:00:00ZAlbNews Albanian Topic ModelingÇano, Erionhttp://hdl.handle.net/11234/1-54112024-02-19T08:40:46Z2024-02-07T00:00:00ZAlbNews Albanian Topic Modeling
Çano, Erion
AlbNews is a topic modeling corpus of news headlines in Albanian, consisting of 600 labeled samples and 2600 unlabeled samples. Each labeled sample includes a headline text retrieved from Albanian online news portals. It also contains one of the four labels: 'pol' for politics, 'cul' for culture, 'eco' for economy, and 'spo' for sport. Each of the unlabeled samples contain a headline text only.AlbTopic corpus is released under CC-BY 4.0 license (https://creativecommons.org/licenses/by/4.0/). If using the data, please cite the following paper:
Çano Erion, Lamaj Dario. AlbNews: A Corpus of Headlines for Topic Modeling in Albanian. CoRR, abs/2402.04028, 2024. URL: https://arxiv.org/abs/2402.04028.
2024-02-07T00:00:00ZESIC 1.1 -- Europarl Simultaneous Interpreting Corpus (2024-02-05)Macháček, DominikŽilinec, MatúšBojar, Ondřejhttp://hdl.handle.net/11234/1-54152024-02-06T15:56:52Z2024-02-05T00:00:00ZESIC 1.1 -- Europarl Simultaneous Interpreting Corpus (2024-02-05)
Macháček, Dominik; Žilinec, Matúš; Bojar, Ondřej
ESIC (Europarl Simultaneous Interpreting Corpus) is a corpus of 370 speeches (10 hours) in English, with manual transcripts, transcribed simultaneous interpreting into Czech and German, and parallel translations.
The corpus contains source English videos and audios. The interpreters' voices are not published within the corpus, but there is a tool that downloads them from the web of European Parliament, where they are publicly avaiable.
The transcripts are equipped with metadata (disfluencies, mixing voices and languages, read or spontaneous speech, etc.), punctuated, and with word-level timestamps.
The speeches in the corpus come from the European Parliament plenary sessions, from the period 2008-11. Most of the speakers are MEP, both native and non-native speakers of English. The corpus contains metadata about the speakers (name, surname, id, fraction) and about the speech (date, topic, read or spontaneous).
ESIC has validation and evaluation parts.
The current version is ESIC v1.1, it extends v1.0 with manual sentence alignment of the tri-parallel texts, and with bi-parallel sentence alignment of English original transcripts and German interpreting.
2024-02-05T00:00:00ZDiakorp v6: diachronic corpus of CzechKučera, KarelŘehořková, AnnaStluka, Martinhttp://hdl.handle.net/11234/1-54132024-02-01T21:14:24Z2015-12-18T00:00:00ZDiakorp v6: diachronic corpus of Czech
Kučera, Karel; Řehořková, Anna; Stluka, Martin
Diachronic corpus of Czech sized 3.45 million words (i.e. 4.1 million tokens). It contains 116 texts from the 14th-20th century period. The texts are transcribed, not transliterated. Diakorp v6 is provided in a CoNLL-U-like vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query interface to the registered users of CNC at http://www.korpus.cz
2015-12-18T00:00:00ZParCzech 4.0Kopp, Matyášhttp://hdl.handle.net/11234/1-53602024-02-01T21:10:38Z2024-01-31T00:00:00ZParCzech 4.0
Kopp, Matyáš
The ParCzech 4.0 corpus consists of stenographic protocols that record the Chamber of Deputies' meetings in the 7th term (2013-2017), the 8th term (2017-2021) and the current 9th term (2021-Jul 2023). The protocols are provided in their original HTML format, Parla-CLARIN TEI format. The corpus is automatically enriched with the morphological, syntactic, and named-entity annotations using the procedures UDPipe 2 and NameTag 2. The audio files are aligned with the texts in the annotated TEI files.
The audio files in this corpus are available in AudioPSP 24.01 corpus (http://hdl.handle.net/11234/1-5404).
This corpus covers the same period as ParlaMint-CZ corpus v4.0 (http://hdl.handle.net/11356/1860). ParCzech corpus follows and extends the ParlaMint schema. Both annotated and non-annotated versions include hypertext references to voting and parliamentary prints. In addition to ParlaMint's recommendation, the annotated version contains source audio alignment, PDT xtag, and more detailed CNEC2.0 named entity categorization.
2024-01-31T00:00:00ZAudioPSP 24.01: Audio recordings of proceedings of the Chamber of Deputies of the Parliament of the Czech RepublicKopp, Matyášhttp://hdl.handle.net/11234/1-54042024-02-19T10:42:44Z2024-01-01T00:00:00ZAudioPSP 24.01: Audio recordings of proceedings of the Chamber of Deputies of the Parliament of the Czech Republic
Kopp, Matyáš
This record contains audio recordings of proceedings of the Chamber of Deputies of the Parliament of the Czech Republic. The recordings have been provided by the official websites of the Chamber of Deputies, and the set contains them in their original format with no further processing.
Recordings cover all available audio files from 2013-11-25 to 2023-07-26. Audio files are packed by year (2013-2023) and quarter (Q1-Q4) in tar archives audioPSP-YYYY-QN.tar.
Furthermore, there are two TSV files: audioPSP-meta.quarterArchive.tsv contains metadata about archives, and audioPSP-meta.audioFile.tsv contains metadata about individual audio files.
2024-01-01T00:00:00ZKUK 0.0Hladká, BarboraCinková, SilvieKuk, MichalMírovský, JiříNovotná, TerezaZahálková, Kristýna Nguyenhttp://hdl.handle.net/11234/1-53632024-01-31T14:08:37Z2023-12-31T00:00:00ZKUK 0.0
Hladká, Barbora; Cinková, Silvie; Kuk, Michal; Mírovský, Jiří; Novotná, Tereza; Zahálková, Kristýna Nguyen
KUK 0.0 is a pilot version of a corpus of Czech legal and administrative texts designated as data for manual and automatic assessment of accessibility (comprehensibility or clarity) of Czech legal texts.
2023-12-31T00:00:00ZCorPipe 23 multilingual CorefUD 1.1 model (corpipe23-corefud1.1-231206)Straka, Milanhttp://hdl.handle.net/11234/1-53692024-01-07T11:29:21Z2023-12-06T00:00:00ZCorPipe 23 multilingual CorefUD 1.1 model (corpipe23-corefud1.1-231206)
Straka, Milan
The `corpipe23-corefud1.1-231206` is a `mT5-large`-based multilingual model for coreference resolution usable in CorPipe 23 (https://github.com/ufal/crac2023-corpipe). It is released under the CC BY-NC-SA 4.0 license.
The model is language agnostic (no _corpus id_ on input), so it can be used to predict coreference in any `mT5` language (for zero-shot evaluation, see the paper). However, note that the empty nodes must be present already on input, they are not predicted (the same settings as in the CRAC23 shared task).
2023-12-06T00:00:00ZCorpus of precisely articulated Czech speechHanzlíček, ZdeněkKochová, PavlaTihelka, DanielKövérová, MarkétaMatoušek, JindřichŠeveček, Pavelhttp://hdl.handle.net/11234/1-53312023-12-21T10:41:03Z2023-12-15T00:00:00ZCorpus of precisely articulated Czech speech
Hanzlíček, Zdeněk; Kochová, Pavla; Tihelka, Daniel; Kövérová, Markéta; Matoušek, Jindřich; Ševeček, Pavel
The corpus contains speech data of 2 Czech native speakers, male and female. The speech is very precisely articulated up to hyper-articulated, and the speech rate is low. The speech data with a highlighted articulation is suitable for teaching foreigners the Czech language, and it can also be used for people with hearing or speech impairment. The recorded sentences can be used either directly, e.g., as a part of educational material, or as source data for building complex educational systems incorporating speech synthesis technology. All recorded sentences were precisely orthographically annotated and phonetically segmented, i.e., split into phones, using modern neural network-based methods.
2023-12-15T00:00:00Z