Harvested from: LINDAT/CLARIAH-CZ repository - LINDAT/CLARIAH-CZ Catalog Search Results

11. A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents (2023-01-05)

Creator:: Novotný, Vít, Luger, Kristýna, Štefánik, Michal, Vrabcová, Tereza, and Horák, Aleš
Publisher:: Masaryk University, Brno
Type:: text and corpus
Subject:: NER, named entity recognition, and Medieval
Language:: Czech, English, German, and Latin
Description:: This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
Rights:: Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB

12. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents

Creator:: Novotný, Vít, Seidlová, Kristýna, Vrabcová, Tereza, and Horák, Aleš
Publisher:: Masaryk University, Brno
Type:: image and corpus
Subject:: ocr, optical character recognition, language identification, image super-resolution, sr, and Medieval
Language:: German, Czech, Latin, and English
Description:: This is an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification.
Rights:: Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB

13. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents: Supplementary Materials

Creator:: Novotný, Vít and Horák, Aleš
Publisher:: Masaryk University, Brno
Type:: text and corpus
Subject:: ocr, optical character recognition, language identification, image super-resolution, sr, and Medieval
Language:: Czech, English, German, and Latin
Description:: These are supplementary materials for an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification and is available at http://hdl.handle.net/11234/1-4615. These supplementary materials contain OCR texts from different OCR engines for book pages for which we have both high-resolution scanned images and annotations for OCR evaluation.
Rights:: Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB

14. A Manifestation for Reinhard Heydrich at the ND

Creator:: Aktualita
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: Heydrichiáda, tryzna Heydrich Reinhard, divadlo interiér, lóže divadelní, orlice říšskoněmecká, znak zemský Čechy, znak zemský Morava, busta Heydrich Reinhard, projevy veřejné, lidé tleskající, hajlování, lidé hajlující, manifestace divadelníků, Národní divadlo, Places::Praha::Nové Město::Národní divadlo /int./, People::Krejčí Jaroslav (1892-1956), People::Deyl Rudolf st. (1876-1972), People::Moravec Emanuel (1893-1945), People::Nasková Růžena (1884-1960), People::Höger Karel (1909-1977), People::Futurista Ferenc (1891-1947), People::Neumann Stanislav (1902-1975), People::Nový Oldřich (1899-1983), People::Šejbalová Jiřina (1905-1981), People::Baldová Zdenka (1885-1958), People::Průcha Jaroslav (1898-1963), and Český zvukový týdeník Aktualita::1942/27A
Language:: Czech
Description:: Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) 1942, issue no. 27A, captures the Pledge of Czech Theatre Professionals´ Allegiance to the Reich, a manifestation held at the National Theatre in Prague on 25 June 1942, which was to unequivocally condemn the assassination of Acting Reich Protector Reinhard Heydrich. Speeches are delivered by actor Rudolf Deyl Jr. and Minister of Education and People´s Enlightenment Emanuel Moravec (silent). Actress Růžena Nasková and actors Karel Höger, Ferenc Futurista, and Stanislav Neumann are seen among the participants. The segment concludes with everyone performing the Nazi salute.
Rights:: Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB

15. A morphological layer for the German part of the SMULTRON corpus

Creator:: Müller, Thomas, Schütze, Hinrich, Caratti, Francesca, and Recknagel, Arne
Publisher:: Center for Information and Language Processing, University of Munich
Type:: text and corpus
Subject:: morphology, morphological tagging, and PoS tagging
Language:: German
Description:: A morphological layer for the German part of the SMULTRON corpus. Layer was annotated according to the STTS tagset and the annotation guidelines of the Tiger corpus. Coordinator: Thomas Müller Annotators: Francesca Caratti, Arne Recknagel This distribution contains a morphological layer for the SMULTRON corpus [0]. The annotation process is described in : @InProceedings{mueller2015, author = {M\"uller, Thomas and Sch\"utze, Hinrich}, title = {Robust Morphological Tagging with Word Representations}, booktitle = {Proceedings of NAACL}, year = {2015}, } [0] http://www.cl.uzh.ch/research/parallelcorpora/paralleltreebanks/smultron_en.html
Rights:: Creative Commons - Attribution 3.0 Unported (CC BY 3.0), http://creativecommons.org/licenses/by/3.0/, and PUB

16. A simplified front-end for SemTi-Kamols morphological analyser

Publisher:: Institute of Mathematics and Computer Science, University of Latvia
Type:: toolService
Subject:: morphological analyzer
Language:: Latvian
Description:: A simplified front-end (in a form of a RESTful web service) of the SemTi-Kamols morphological analyzer. Mainly for demonstration purposes.
Rights:: Not specified

17. A Small Dataset for English-to-Czech Speech Translation in the Travel Domain

Creator:: Cífka, Ondřej and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: audio and corpus
Subject:: speech corpus, ASR, and machine translation
Language:: English and Czech
Description:: This small dataset contains 3 speech corpora collected using the Alex Translate telephone service (https://ufal.mff.cuni.cz/alex#alex-translate). The "part1" and "part2" corpora contain English speech with transcriptions and Czech translations. These recordings were collected from users of the service. Part 1 contains earlier recordings, filtered to include only clean speech; Part 2 contains later recordings with no filtering applied. The "cstest" corpus contains recordings of artificially created sentences, each containing one or more Czech names of places in the Czech Republic. These were recorded by a multinational group of students studying in Prague.
Rights:: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB

18. A Speech Test Set of Practice Business Presentations with Additional Relevant Texts

Creator:: Macháček, Dominik, Kratochvíl, Jonáš, Vojtěchová, Tereza, and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: audio and corpus
Subject:: ASR, ASR evaluation, speech corpus, non-native English, speech recognition, speech recognition evaluation, speech and relevant texts, and European non-native English
Language:: English
Description:: We present a test corpus of audio recordings and transcriptions of presentations of students' enterprises together with their slides and web-pages. The corpus is intended for evaluation of automatic speech recognition (ASR) systems, especially in conditions where the prior availability of in-domain vocabulary and named entities is benefitable. The corpus consists of 39 presentations in English, each up to 90 seconds long, and slides and web-pages in Czech, Slovak, English, German, Romanian, Italian or Spanish. The speakers are high school students from European countries with English as their second language. We benchmark three baseline ASR systems on the corpus and show their imperfection.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

19. A.C. Nor (writer)

Creator:: Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: Galerie osobností, Places::Praha::Nové Město::Školská::pavlač domu, and People::Nor A.C. (1903-1986)
Language:: No linguistic content
Description:: Writer A. C. Nor with an unidentified man on Bohumil Veselý's balcony.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

20. ABC - Language Identifier

Publisher:: Research Institute for Artificial Intelligence, Romanian Academy of Sciences
Type:: toolService
Description:: The application, developed in C#, automatically identifies the language of a text written in one of the 21 European Union languages. By using training texts in different languages (approx. 1.5Mb of text for each language), a training module counts the prefixes (the first 3 characters) and the suffixes (4 characters endings) for all the words in the texts, for each language. For every language two models are constructed, containing the weights (percentages) of prefixes and suffixes in the texts representing a language. In the prediction phase, for a new text, two models are built on the fly in a similar manner. These models are then compared with the stored models representing each language for which the application was trained. Using comparison functions, the best model is chose. More detailed descriptions are available in [[http://www.racai.ro/~tufis/papers|the following papers]]: -- Dan Tufiş, Radu Ion, Alexandru Ceauşu, and Dan Ştefănescu (2008). RACAI's Linguistic Web Services. In Proceedings of the 6th Language Resources and Evaluation Conference - LREC 2008, Marrakech, Morocco, May 2008. ELRA - European Language Resources Association. ISBN 2-9517408-4-0. -- Dan Tufiş and Alexandru Ceauşu (2007). Diacritics Restoration in Romanian Texts. In Elena Paskaleva and Milena Slavcheva (eds.), A Common Natural Language Processing Paradigm for Balkan Languages - RANLP 2007 Workshop Proceedings, pp. 49-56, Borovets, Bulgaria, September 2007. INCOMA Ltd., Shoumen, Bulgaria. ISBN 978-954-91743-8-0. -- Dan Tufiş and Adrian Chiţu (1999). Automatic Insertion of Diacritics in Romanian Texts. In Ferenc Kiefer, Gábor Kiss, and Júlia Pajzs (eds.), Proceedings of the 5th International Workshop on Computational Lexicography (COMPLEX 1999), pp. 185-194, Pecs, Hungary, May 1999. Linguistics Institute, Hungarian Academy of Sciences.
Rights:: Not specified

11. A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents (2023-01-05)

12. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents

13. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents: Supplementary Materials

14. A Manifestation for Reinhard Heydrich at the ND

15. A morphological layer for the German part of the SMULTRON corpus

16. A simplified front-end for SemTi-Kamols morphological analyser

17. A Small Dataset for English-to-Czech Speech Translation in the Travel Domain

18. A Speech Test Set of Practice Business Presentations with Additional Relevant Texts

19. A.C. Nor (writer)

20. ABC - Language Identifier

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Show values starting with

Creator

Show values starting with

Format

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Show values starting with

Date

Original context has metadata only

Harvested from