web-based information system on scientific community (news, events, persons, job market, mailing list, database on research projects and corpora, bibliography, glossary and links) and recording equipment/software; disciplinary scope: research on conversation and discourse analysis and spoken language
Glossa is a web-based system for corpus search and results management. It comes with built-in support for CLARIN federated content search as well as corpora encoded with the IMS Corpus Workbench. It also has a plugin architecture that enables other search engines to be used once a wrapper has been created.Glossa can be freely downloaded and installed on the user's server. It currently supports only monolignual written corpora, but support for multilingual corpora is under development, as well as support for spoken corpora with audio, video and maps.
Annotated list of dependency bigrams occurring in the PDT more than five times and having part-of-speech patterns that can possibly form a collocation. Each bigram is assigned to one of the six MWE categories by three annotators.
The GrandStaff-LMX dataset is based on the GrandStaff dataset described in the "End-to-end optical music recognition for pianoform sheet music" paper by Antonio Ríos-Vila et al., 2023, https://doi.org/10.1007/s10032-023-00432-z .
The GrandStaff-LMX dataset contains MusicXML and Linearized MusicXML encodings of all systems from the original datase, suitable for evaluation with the TEDn metric. It also contains the GrandStaff official train/dev/split.
70K words, Non-validated sentence segmentation. Non-validated POS tagging, Manual annotation of syntactic dependencies and dependency labels, Manual annotation of semantic roles, Manual annotation of events based on a shallow domain specific ontology (only for a 31K words subset of GDT)
The dataset of handwritten Czech text lines, sourced from two chronicles (municipal chronicles 1931-1944, school chronicles 1913-1933).
The dataset comprises 25k lines machine-extracted from scanned pages, and provides manual annotation of text contents for a subset of size 2k.
HamleDT 2.0 is a collection of 30 existing treebanks harmonized into a common annotation style, the Prague Dependencies, and further transformed into Stanford Dependencies, a treebank annotation style that became popular recently. We use the newest basic Universal Stanford Dependencies, without added language-specific subtypes.