1 - 8 of 8
Number of results to display per page
Search Results
2. A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents (2023-01-05)
- Creator:
- Novotný, Vít, Luger, Kristýna, Štefánik, Michal, Vrabcová, Tereza, and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- NER, named entity recognition, and Medieval
- Language:
- Czech, English, German, and Latin
- Description:
- This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
3. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents
- Creator:
- Novotný, Vít, Seidlová, Kristýna, Vrabcová, Tereza, and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- image and corpus
- Subject:
- ocr, optical character recognition, language identification, image super-resolution, sr, and Medieval
- Language:
- German, Czech, Latin, and English
- Description:
- This is an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification.
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
4. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents: Supplementary Materials
- Creator:
- Novotný, Vít and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- ocr, optical character recognition, language identification, image super-resolution, sr, and Medieval
- Language:
- Czech, English, German, and Latin
- Description:
- These are supplementary materials for an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification and is available at http://hdl.handle.net/11234/1-4615. These supplementary materials contain OCR texts from different OCR engines for book pages for which we have both high-resolution scanned images and annotations for OCR evaluation.
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
5. SQAD
- Creator:
- Medveď, Marek and Horák, Aleš
- Publisher:
- Masaryk University, NLP Centre
- Type:
- text and corpus
- Subject:
- question answering, Simple Question Answering Database, and SQAD
- Language:
- Czech
- Description:
- The SQAD database consists of 3301 records obtained from Czech Wikipedia articles. The record structure is following: - the original sentence(s) from Wikipedia - a question that is directly answered in the text - the expected answer to the question as it appears in the original text - the URL of the Wikipedia web page from which the original text was extracted - name of the author of this SQAD record
- Rights:
- GNU General Public Licence, version 3, http://opensource.org/licenses/GPL-3.0, and PUB
6. sqad 2.1
- Creator:
- Medveď, Marek, Horák, Aleš, and Kušniráková, Dáša
- Publisher:
- Natural Language Processing Centre, Faculty of Informatics, Masaryk University
- Type:
- text and corpus
- Subject:
- Czech, Simple Question Answering Database, and question answering
- Language:
- Czech
- Description:
- Simple question answering database version 2.1 (SQAD_v2.1) created from Czech Wikipedia. Each record of SQAD consist of four files (in vertical form provided with lemmatization and POS tagging) and two metadata files.
- Rights:
- GNU Library or "Lesser" General Public License 3.0 (LGPL-3.0), http://opensource.org/licenses/LGPL-3.0, and PUB
7. sqad 3.0
- Creator:
- Medveď, Marek and Horák, Aleš
- Publisher:
- Masaryk University, NLP Centre
- Type:
- text and corpus
- Subject:
- Simple Question Answering Database, Czech, and question answering
- Language:
- Czech
- Description:
- Simple question answering database version 3 (SQAD v3) created from Czech Wikipedia. New version consits of 13477 records. Each record of SQAD consist of multiple files - question, answer extraction, answer selection, ulr, question metadata and in some cases answer context.
- Rights:
- GNU Library or "Lesser" General Public License 3.0 (LGPL-3.0), http://opensource.org/licenses/LGPL-3.0, and PUB
8. SQAD v2
- Creator:
- Medveď, Marek, Horák, Aleš, and Šulganová, Terézia
- Publisher:
- Natural Language Processing Centre, Faculty of Informatics, Masaryk University
- Type:
- text and corpus
- Subject:
- question answering, Czech, and Simple Question Answering Database
- Language:
- Czech
- Description:
- Simple question answering database (SQAD) created from Czech Wikipedia. Each record of SQAD consist of four files (in vertical form provided with lemmatization and POS tagging) and two metadata files.
- Rights:
- GNU General Public Licence, version 3, http://opensource.org/licenses/GPL-3.0, and PUB