Harvested from: LINDAT/CLARIAH-CZ repository - LINDAT/CLARIAH-CZ Catalog Search Results

1151. Julius Skřivan (impresario)

Creator:: Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: Galerie osobností and People::Skřivan Julius (1860-1923)
Language:: No linguistic content
Description:: Footage of impresario Julius Skřivan.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

1152. Julius Stoklasa (agrobiologist)

Creator:: Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: Galerie osobností, Places::Praha::Nové Město::Školská::pavlač domu, and People::Stoklasa Julius (1857-1936)
Language:: No linguistic content
Description:: Professor and agrobiologist Julius Stoklasa in the Botanical Garden. Stoklasa with his colleagues by a greenhouse.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

1153. jusText

Creator:: Pomikálek, Jan
Publisher:: Masaryk University, NLP Centre
Type:: toolService and tool
Subject:: boilerplate, web documents, text cleaning, boilerplate removal, and text corpora
Language:: English
Description:: jusText is a heuristic based boilerplate removal tool useful for cleaning documents in large textual corpora. The tool has been implemented in Python, licensed under New BSD License and made an open source software (available for download including the source code at http://code.google.com/p/justext/). It is successfully used for cleaning large textual corpora at Natural language processing centre at Faculty of informatics, Masaryk university Brno and it's industry partners. The research leading to this piece of software was published in author's Ph.D. thesis "Removing Boilerplate and Duplicate Content from Web Corpora". The boilerplate removal algorithm is able to remove most of non-grammatical sentences from a web page like navigation, advertisements, tables, short notes and so on. It has been shown it overperforms or at least keeps up with it's competitors (according to comparison with participants of Cleaneval competition in author's Ph.D. thesis). The precise removal of unwanted content and scalability of the algorithm has been demonstrated while building corpora of American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting --- over 20 TB of HTML pages were processed resulting in corpora of 70 billions tokens altogether. and PRESEMT, Lexical Computing Ltd
Rights:: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0), http://creativecommons.org/licenses/by-sa/3.0/, and PUB

1154. K. M. Walló (poet, screenwriter)

Creator:: Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: Galerie osobností, Places::Praha::Nové Město::Školská::pavlač domu, and People::Walló K. M. (1914-1990)
Language:: No linguistic content
Description:: Poet and screenwriter K. M. Walló on Bohumil Veselý's balcony.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

1155. Kacenka : parallel corpus of English and Czech texts

Publisher:: Masaryk University, Brno
Type:: corpus
Language:: Czech and English
Description:: Parallel corpus, 3,297,283 words. The idea was to create a small parallel corpus which would enable to work with entire texts in translation analysis rather then short extracts. At the same time it aimed at acquiring experience that could be used in creating a larger parallel corpus of English and Czech in the future. Although the main part of work has been completed -- and the aims of the KACENKA grant met -- we keep improving and enlarging KACENKA gradually. Currently, it has the size of 3,297,283 words (out of which, 1,689,513 have been acquired by means of scanning). Most of the English texts for KACENKA have been retrieved from the Internet resources. The rest -- and nearly all the Czech texts -- had to be scanned with the use of an OCR programme. KACENKA is stored on a single CD-ROM; its use is limited by copyright restrictions.
Rights:: Not specified

1156. KAL Corpus

Publisher:: Department of Linguistics and Nordic Studies, University of Oslo
Type:: corpus
Language:: Norwegian
Description:: 3300 texts written by pupils for the final in Norwegian language in 1998, 1999, 2000 and 2001. The database also includes associated grades and other background material.
Rights:: Not specified

1157. Kali-Korpus

Publisher:: Leibniz Universität Hannover
Type:: corpus
Subject:: Germanistik
Language:: German
Description:: Diachronic corpus with focus on annotation and lemmatization of verbal categories; diachrones Korpus mit Fokus auf Annotation und Lemmatisierung von Verbalkategorien
Rights:: Not specified

1151. Julius Skřivan (impresario)

1152. Julius Stoklasa (agrobiologist)

1153. jusText

1154. K. M. Walló (poet, screenwriter)

1155. Kacenka : parallel corpus of English and Czech texts

1156. KAL Corpus

1157. Kali-Korpus

1158. Kamil Hilbert (architect)

1159. Kamil Lhoták (painter)

1160. Kamila Ungrová (opera singer)

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Show values starting with

Creator

Show values starting with

Format

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Show values starting with

Date

Original context has metadata only

Harvested from