« Previous |
1 - 10 of 11
|
Next »
Number of results to display per page
Search Results
2. A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents (2023-01-05)
- Creator:
- Novotný, Vít, Luger, Kristýna, Štefánik, Michal, Vrabcová, Tereza, and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- NER, named entity recognition, and Medieval
- Language:
- Czech, English, German, and Latin
- Description:
- This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
3. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents
- Creator:
- Novotný, Vít, Seidlová, Kristýna, Vrabcová, Tereza, and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- image and corpus
- Subject:
- ocr, optical character recognition, language identification, image super-resolution, sr, and Medieval
- Language:
- German, Czech, Latin, and English
- Description:
- This is an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification.
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
4. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents: Supplementary Materials
- Creator:
- Novotný, Vít and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- ocr, optical character recognition, language identification, image super-resolution, sr, and Medieval
- Language:
- Czech, English, German, and Latin
- Description:
- These are supplementary materials for an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification and is available at http://hdl.handle.net/11234/1-4615. These supplementary materials contain OCR texts from different OCR engines for book pages for which we have both high-resolution scanned images and annotations for OCR evaluation.
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
5. Annotated Corpus of Czech Case Law for Reference Recognition Tasks
- Creator:
- Harašta, Jakub, Šavelka, Jaromír, Kasl, František, Kotková, Adéla, Loutocký, Pavel, Míšek, Jakub, Procházková, Daniela, Pullmannová, Helena, Semenišín, Petr, Šejnová, Tamara, Šimková, Nikola, Vosinek, Michal, Zavadilová, Lucie, and Zibner, Jan
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- reference recognition and legal texts
- Language:
- Czech
- Description:
- Annotated corpus of 350 decision of Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court). Every decision is annotated by two trained annotators and then manually adjudicated by one trained curator to solve possible disagreements between annotators. Adjudication was conducted non-destructively, therefore dataset contains all original annotations. Corpus was developed as training and testing material for reference recognition tasks. Dataset contains references to other court decisions and literature. All references consist of basic units (identifier of court decision, identification of court issuing referred decision, author of book or article, title of book or article, point of interest in referred document etc.), values (polarity, depth of discussion etc.).
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
6. Annotated Corpus of Czech Case Law for Reference Recognition Tasks (2019-06-25)
- Creator:
- Harašta, Jakub, Šavelka, Jaromír, Kasl, František, Kotková, Adéla, Loutocký, Pavel, Míšek, Jakub, Procházková, Daniela, Pullmannová, Helena, Semenišín, Petr, Šejnová, Tamara, Šimková, Nikola, Vosinek, Michal, Zavadilová, Lucie, and Zibner, Jan
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- reference recognition and legal texts
- Language:
- Czech
- Description:
- Annotated corpus of 350 decision of Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court). Every decision is annotated by two trained annotators and then manually adjudicated by one trained curator to solve possible disagreements between annotators. Adjudication was conducted non-destructively, therefore corpus (raw) contains all original annotations. Corpus was developed as training and testing material for reference recognition tasks. Dataset contains references to other court decisions and literature. All references consist of basic units (identifier of court decision, identification of court issuing referred decision, author of book or article, title of book or article, point of interest in referred document etc.), values (polarity, depth of discussion etc.).
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
7. Annotated Corpus of Czech Case Law for Segmentation Tasks
- Creator:
- Harašta, Jakub, Šavelka, Jaromír, Kasl, František, and Míšek, Jakub
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- document segmentation and legal texts
- Language:
- Czech
- Description:
- Annotated corpus of 350 decision of Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court). 280 decisions were annotated by one trained annotator and then manually adjudicated by one trained curator. 70 decisions were annotated by two trained annotators and then manually adjudicated by one trained curator. Adjudication was conducted destructively, therefore dataset contains only the correct annotations and does not contain all original annotations. Corpus was developed as training and testing material for text segmentation tasks. Dataset contains decision segmented into Header, Procedural History, Submission/Rejoinder, Court Argumentation, Footer, Footnotes, and Dissenting Opinion. Segmentation allows to treat different parts of text differently even if it contains similar linguistic or other features.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
8. Czech and English Reflective Dataset (CEReD)
- Creator:
- Štefánik, Michal and Nehyba, Jan
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- reflective writing, reflective categories, pre-service teachers, and hand annotation
- Language:
- English and Czech
- Description:
- The database contains annotated reflective sentences, which fall into the categories of reflective writing according to Ullmann's (2019) model. The dataset is ready to replicate these categories' prediction using machine learning. Available from: https://anonymous.4open.science/repository/c856595c-dfc2-48d7-aa3d-0ccc2648c4dc/data
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
9. Digital humanities: Introduction. A 10-week course with practical sessions.
- Creator:
- Nevěřilová, Zuzana
- Publisher:
- Masaryk University, Brno
- Type:
- VIDEO and onlineCourse
- Subject:
- data-driven research, digital content processing, text processing, image processing, metadata, word embeddings, evaluation, and research infrastructures
- Language:
- English
- Description:
- The aim of the course is to introduce digital humanities and to describe various aspects of digital content processing. The course consists of 10 lessons with video material and a PowerPoint presentation with the same content. Every lesson contains a practical session – either a Jupyter Notebook to work in Python or a text file with a short description of the task. Most of the practical tasks consist of running the programme and analyse the results. Although the course does not focus on programming, the code can be reused easily in individual projects. Some experience in running Python code is desirable but not required.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
10. Kacenka : parallel corpus of English and Czech texts
- Publisher:
- Masaryk University, Brno
- Type:
- corpus
- Language:
- Czech and English
- Description:
- Parallel corpus, 3,297,283 words. The idea was to create a small parallel corpus which would enable to work with entire texts in translation analysis rather then short extracts. At the same time it aimed at acquiring experience that could be used in creating a larger parallel corpus of English and Czech in the future. Although the main part of work has been completed -- and the aims of the KACENKA grant met -- we keep improving and enlarging KACENKA gradually. Currently, it has the size of 3,297,283 words (out of which, 1,689,513 have been acquired by means of scanning). Most of the English texts for KACENKA have been retrieved from the Internet resources. The rest -- and nearly all the Czech texts -- had to be scanned with the use of an OCR programme. KACENKA is stored on a single CD-ROM; its use is limited by copyright restrictions.
- Rights:
- Not specified