1 - 9 of 9
Number of results to display per page
Search Results
2. A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents (2023-01-05)
- Creator:
- Novotný, Vít, Luger, Kristýna, Štefánik, Michal, Vrabcová, Tereza, and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- NER, named entity recognition, and Medieval
- Language:
- Czech, English, German, and Latin
- Description:
- This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
3. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents
- Creator:
- Novotný, Vít, Seidlová, Kristýna, Vrabcová, Tereza, and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- image and corpus
- Subject:
- ocr, optical character recognition, language identification, image super-resolution, sr, and Medieval
- Language:
- German, Czech, Latin, and English
- Description:
- This is an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification.
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
4. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents: Supplementary Materials
- Creator:
- Novotný, Vít and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- ocr, optical character recognition, language identification, image super-resolution, sr, and Medieval
- Language:
- Czech, English, German, and Latin
- Description:
- These are supplementary materials for an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification and is available at http://hdl.handle.net/11234/1-4615. These supplementary materials contain OCR texts from different OCR engines for book pages for which we have both high-resolution scanned images and annotations for OCR evaluation.
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
5. Annotated Corpus of Czech Case Law for Reference Recognition Tasks
- Creator:
- Harašta, Jakub, Šavelka, Jaromír, Kasl, František, Kotková, Adéla, Loutocký, Pavel, Míšek, Jakub, Procházková, Daniela, Pullmannová, Helena, Semenišín, Petr, Šejnová, Tamara, Šimková, Nikola, Vosinek, Michal, Zavadilová, Lucie, and Zibner, Jan
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- reference recognition and legal texts
- Language:
- Czech
- Description:
- Annotated corpus of 350 decision of Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court). Every decision is annotated by two trained annotators and then manually adjudicated by one trained curator to solve possible disagreements between annotators. Adjudication was conducted non-destructively, therefore dataset contains all original annotations. Corpus was developed as training and testing material for reference recognition tasks. Dataset contains references to other court decisions and literature. All references consist of basic units (identifier of court decision, identification of court issuing referred decision, author of book or article, title of book or article, point of interest in referred document etc.), values (polarity, depth of discussion etc.).
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
6. Annotated Corpus of Czech Case Law for Reference Recognition Tasks (2019-06-25)
- Creator:
- Harašta, Jakub, Šavelka, Jaromír, Kasl, František, Kotková, Adéla, Loutocký, Pavel, Míšek, Jakub, Procházková, Daniela, Pullmannová, Helena, Semenišín, Petr, Šejnová, Tamara, Šimková, Nikola, Vosinek, Michal, Zavadilová, Lucie, and Zibner, Jan
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- reference recognition and legal texts
- Language:
- Czech
- Description:
- Annotated corpus of 350 decision of Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court). Every decision is annotated by two trained annotators and then manually adjudicated by one trained curator to solve possible disagreements between annotators. Adjudication was conducted non-destructively, therefore corpus (raw) contains all original annotations. Corpus was developed as training and testing material for reference recognition tasks. Dataset contains references to other court decisions and literature. All references consist of basic units (identifier of court decision, identification of court issuing referred decision, author of book or article, title of book or article, point of interest in referred document etc.), values (polarity, depth of discussion etc.).
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
7. Annotated Corpus of Czech Case Law for Segmentation Tasks
- Creator:
- Harašta, Jakub, Šavelka, Jaromír, Kasl, František, and Míšek, Jakub
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- document segmentation and legal texts
- Language:
- Czech
- Description:
- Annotated corpus of 350 decision of Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court). 280 decisions were annotated by one trained annotator and then manually adjudicated by one trained curator. 70 decisions were annotated by two trained annotators and then manually adjudicated by one trained curator. Adjudication was conducted destructively, therefore dataset contains only the correct annotations and does not contain all original annotations. Corpus was developed as training and testing material for text segmentation tasks. Dataset contains decision segmented into Header, Procedural History, Submission/Rejoinder, Court Argumentation, Footer, Footnotes, and Dissenting Opinion. Segmentation allows to treat different parts of text differently even if it contains similar linguistic or other features.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
8. Czech and English Reflective Dataset (CEReD)
- Creator:
- Štefánik, Michal and Nehyba, Jan
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- reflective writing, reflective categories, pre-service teachers, and hand annotation
- Language:
- English and Czech
- Description:
- The database contains annotated reflective sentences, which fall into the categories of reflective writing according to Ullmann's (2019) model. The dataset is ready to replicate these categories' prediction using machine learning. Available from: https://anonymous.4open.science/repository/c856595c-dfc2-48d7-aa3d-0ccc2648c4dc/data
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
9. Parallel Global Voices, Czech-English NER+NEL
- Creator:
- Nevěřilová, Zuzana and Žižková, Hana
- Publisher:
- Masaryk University, Brno
- Type:
- text, other, and lexicalConceptualResource
- Subject:
- named entity recognition, named entities, named entity, named entitity corpus, named entity linking, named entity disambiguation, and wikidata
- Language:
- English and Czech
- Description:
- Annotation of named entities to the existing source Parallel Global Voices, ces-eng language pair. The named entity annotations distinguish four classes: Person, Organization, Location, Misc. The annotation is in the IOB schema (annotation per token, beginning + inside of the multi-word annotation). NEL annotation contains Wikidata Qnames.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB