1 - 8 of 8
Number of results to display per page
Search Results
2. A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents (2023-01-05)
- Creator:
- Novotný, Vít, Luger, Kristýna, Štefánik, Michal, Vrabcová, Tereza, and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- NER, named entity recognition, and Medieval
- Language:
- Czech, English, German, and Latin
- Description:
- This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
3. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents
- Creator:
- Novotný, Vít, Seidlová, Kristýna, Vrabcová, Tereza, and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- image and corpus
- Subject:
- ocr, optical character recognition, language identification, image super-resolution, sr, and Medieval
- Language:
- German, Czech, Latin, and English
- Description:
- This is an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification.
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
4. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents: Supplementary Materials
- Creator:
- Novotný, Vít and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- ocr, optical character recognition, language identification, image super-resolution, sr, and Medieval
- Language:
- Czech, English, German, and Latin
- Description:
- These are supplementary materials for an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification and is available at http://hdl.handle.net/11234/1-4615. These supplementary materials contain OCR texts from different OCR engines for book pages for which we have both high-resolution scanned images and annotations for OCR evaluation.
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
5. Czech and English Reflective Dataset (CEReD)
- Creator:
- Štefánik, Michal and Nehyba, Jan
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- reflective writing, reflective categories, pre-service teachers, and hand annotation
- Language:
- English and Czech
- Description:
- The database contains annotated reflective sentences, which fall into the categories of reflective writing according to Ullmann's (2019) model. The dataset is ready to replicate these categories' prediction using machine learning. Available from: https://anonymous.4open.science/repository/c856595c-dfc2-48d7-aa3d-0ccc2648c4dc/data
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
6. Digital humanities: Introduction. A 10-week course with practical sessions.
- Creator:
- Nevěřilová, Zuzana
- Publisher:
- Masaryk University, Brno
- Type:
- VIDEO and onlineCourse
- Subject:
- data-driven research, digital content processing, text processing, image processing, metadata, word embeddings, evaluation, and research infrastructures
- Language:
- English
- Description:
- The aim of the course is to introduce digital humanities and to describe various aspects of digital content processing. The course consists of 10 lessons with video material and a PowerPoint presentation with the same content. Every lesson contains a practical session – either a Jupyter Notebook to work in Python or a text file with a short description of the task. Most of the practical tasks consist of running the programme and analyse the results. Although the course does not focus on programming, the code can be reused easily in individual projects. Some experience in running Python code is desirable but not required.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
7. Kacenka : parallel corpus of English and Czech texts
- Publisher:
- Masaryk University, Brno
- Type:
- corpus
- Language:
- Czech and English
- Description:
- Parallel corpus, 3,297,283 words. The idea was to create a small parallel corpus which would enable to work with entire texts in translation analysis rather then short extracts. At the same time it aimed at acquiring experience that could be used in creating a larger parallel corpus of English and Czech in the future. Although the main part of work has been completed -- and the aims of the KACENKA grant met -- we keep improving and enlarging KACENKA gradually. Currently, it has the size of 3,297,283 words (out of which, 1,689,513 have been acquired by means of scanning). Most of the English texts for KACENKA have been retrieved from the Internet resources. The rest -- and nearly all the Czech texts -- had to be scanned with the use of an OCR programme. KACENKA is stored on a single CD-ROM; its use is limited by copyright restrictions.
- Rights:
- Not specified
8. Parallel Global Voices, Czech-English NER+NEL
- Creator:
- Nevěřilová, Zuzana and Žižková, Hana
- Publisher:
- Masaryk University, Brno
- Type:
- text, other, and lexicalConceptualResource
- Subject:
- named entity recognition, named entities, named entity, named entitity corpus, named entity linking, named entity disambiguation, and wikidata
- Language:
- English and Czech
- Description:
- Annotation of named entities to the existing source Parallel Global Voices, ces-eng language pair. The named entity annotations distinguish four classes: Person, Organization, Location, Misc. The annotation is in the IOB schema (annotation per token, beginning + inside of the multi-word annotation). NEL annotation contains Wikidata Qnames.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB