Harvested from: LINDAT/CLARIAH-CZ repository - LINDAT/CLARIAH-CZ Catalog Search Results

218. Bonner Frühneuhochdeutschkorpus (FnhdC)

Publisher:: Korpora.org and Fakultät Geisteswissenschaften, Universität Duisburg-Essen
Type:: corpus
Subject:: Germanistik
Language:: German
Description:: Digital, morphologically annotated (N, V, A) part of the Bonn Corpus of Early New High German; used to create the Grammatik des Frühneuhochdeutschen (III. Nouns; IV. Verbs; VI. Adjectives); morphologisch annotiert; Materialgrundlage für die Erarbeitung der Bände 3, 4 und 6 der "Grammatik des Frühneuhochdeutschen"
Rights:: Not specified

219. Bořek Rujan (opera singer)

Creator:: Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: Galerie osobností, Places::Praha::Nové Město::Školská::pavlač domu, and People::Rujan Bořek (1899-1955)
Language:: No linguistic content
Description:: Opera singer Bořek Rujan on Bohumil Veselý's balcony.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

220. Boris Milec (dancer)

Creator:: Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: Galerie osobností, Places::Praha::Nové Město::Školská::pavlač domu, and People::Milec Boris (1906-1984)
Language:: No linguistic content
Description:: Dancer Boris Milec on Bohumil Veselý's balcony.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

221. Bosworth-Toller’s Anglo-Saxon Dictionary online

Creator:: Tichý, Ondřej, Roček, Martin, Bočková, Renata, Čermák, Matěj, Dragounová, Jolana, Filipová, Helena, Gilová, Lucie, Hejná, Michaela, Hladíková, Lenka, Hladká, Alena, Hubinová, Veronika, Krajcsovicsová, Vlaďena, Kupková, Tatiana, Lebedeva, Tatiana, Malečková, Nikola, Novotná, Alena, Pazderová, Tereza, Popelíková, Jiřina, Rumlová, Jana, Tyčová Ocelík, Dana, Volná, Veronika, and Zahradníková, Tereza
Publisher:: Charles University, Faculty of Arts, Department of English Language and ELT Methodology
Type:: text, lexicon, and lexicalConceptualResource
Subject:: English, Old English, Anglo-Saxon, dictionary, Bosworth, Toller, lexicography, digitalization, English history, Mediaeval, and Medieval
Language:: English, Old English (ca. 450-1100), Latin, Ancient Greek (to 1453), and Ancient Hebrew
Description:: Description : This is an online edition of An Anglo-Saxon Dictionary, or a dictionary of "Old English". The dictionary records the state of the English language as it was used between ca. 700-1100 AD by the Anglo-Saxon inhabitants of the British Isles. This project is based on a digital edition of An Anglo-Saxon dictionary, based on the manuscript collections of the late Joseph Bosworth (the so called Main Volume, first edition 1898) and its Supplement (first edition 1921), edited by Joseph Bosworth and T. Northcote Toller, today the largest complete dictionary of Old English (one day to be hopefully supplanted by the DOE). Alistair Campbell's "enlarged addenda and corrigenda" from 1972 are not public domain and are therefore not part of the online dictionary. Please see the front & back matter of the paper dictionary for further information, prefaces and lists of references & contractions. The digitization project was initiated by Sean Crist in 2001 as a part of his Germanic Lexicon Project and many individuals and institutions have contributed to this project. Check out the original GLP webpage and the old Bosworth-Toller offline application webpage (to be updated). Currently the project is hosted by the Faculty of Arts, Charles University. In 2010, the data from the GLP were converted to create the current site. Care was taken to preserve the typography of the original dictionary, but also provide a modern, user friendly interface for contemporary users. In 2013, the entries were structurally re-tagged and the original typography was abandoned, though the immediate access to the scans of the paper dictionary was preserved. Our aim is to reach beyond a simple digital edition and create an online environment dedicated to all interested in Old English and Anglo-Saxon culture. Feel free to join in the editing of the Dictionary, commenting on its numerous entries or participating in the discussions at our forums. We hope that by drawing the attention of the community of Anglo-Saxonists to our site and joining our resources, we may create a more useful tool for everybody. The most immediate project to draw on the corrected and tagged data of the Dictionary is a Morphological Analyzer of Old English (currently under development). We are grateful for the generous support of the Charles University Grant Agency and for the free hosting at the Faculty of Arts at Charles University. The site is currently maintained and developed by Ondrej Tichy et al. at the Department of English Language and ELT Methodology, Faculty of Arts, Charles University in Prague (Czech Republic).
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

222. Botanicus Digital Library

Type:: corpus
Subject:: Germanistik
Language:: Chinese, Czech, English, French, German, Latin, and Spanish
Description:: Digital copies of historical botanic papers from the Missouri Botanical Garden Library; Bilddigitalisate von historischen botanischen Schriften; deutschsprachige Texte stellen nur einen Teilbereich dar
Rights:: Not specified

223. Boža Vronský (opera and operetta singer)

Creator:: Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: Galerie osobností, Places::Praha::Nové Město::Školská::pavlač domu, and People::Vronský Boža (1889-1955)
Language:: No linguistic content
Description:: Opera and operetta singer Boža Vronský on Bohumil Veselý's balcony.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

224. Božena Benešová (writer)

Creator:: Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: stroj psací, Galerie osobností, and People::Benešová Božena (1873-1936)
Language:: No linguistic content
Description:: Writer Božena Benešová at her typewriter in her study.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

225. Božena Petanová (opera singer)

Creator:: Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: Galerie osobností, Places::Praha::Nové Město::Školská::pavlač domu, and People::Petanová Božena (1888-1958)
Language:: No linguistic content
Description:: Opera singer Božena Petanová-Setunská with an unidentified man.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

226. Břetislav Maria Klika (editor)

Creator:: Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: Galerie osobností, Places::Praha::Nové Město::Školská::pavlač domu, and People::Klika Břetislav Maria (1884-1958)
Language:: No linguistic content
Description:: Editor Břetislav Maria Klika on Bohumil Veselý's balcony.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

227. Břetislav Pračka (collector of literature)

Creator:: Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: kniha Bezruč Petr Slezské písně, Galerie osobností, and People::Pračka Břetislav (1881-1958)
Language:: No linguistic content
Description:: Břetislav Pračka, a collector of literature by and about the poet Petr Bezruč, with an unidentified woman on Bohumil Veselý's balcony. They're examining different editions of Slezské písně (Silesian Songs).
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

229. British National Corpus

Type:: corpus
Language:: English
Description:: General reference corpus; 100 million words; POS, lemma, descriptive metadata
Rights:: Not specified

230. Brockhaus' Kleines Konversations-Lexikon

Type:: lexicalConceptualResource
Subject:: Germanistik
Language:: German
Description:: 5. Aufl. 1911; Fokus auf Politik, Wirtschaft, Kultur und Technik zu Beginn des 20. Jahrhunderts
Rights:: Not specified

231. Broken plural list

Creator:: Ouamer, meriem, Bouzoubaa, Karim, and Tajmout, rachida
Publisher:: ALELM research group
Type:: text, wordList, and lexicalConceptualResource
Subject:: Broken plural
Language:: Arabic
Description:: An LMF conformant XML-based file containing a comprehensive Arabic broken plural list. The file contains 12,249 singular words with their corresponding BPs
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

232. Bronislav Chorovič (opera singer)

Creator:: Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: Galerie osobností, Places::Praha::Nové Město::Školská::pavlač domu, and People::Chorovič Bronislav (1888-1980)
Language:: No linguistic content
Description:: Opera singer Bronislav Chorovič on Bohumil Veselý's balcony.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

233. Budapest Sociolinguistic Interview (BSI)

Publisher:: Academy of Sciences
Type:: corpus
Language:: Hungarian
Description:: BSI is a large-scale survey which provides reliable data on and analyses of the varieties of Hungarian spoken in Budapest.
Rights:: Not specified

234. Bulgarian CLEF Corpus

Type:: corpus
Language:: Bulgarian
Description:: Written, synchronic, general (newspapers)
Rights:: Not specified

235. Bulgarian-Croatian Comparable Corpus

Type:: corpus
Language:: Bulgarian and Croatian
Description:: written; domain-specific (newspaper); diachronic; bilingual; comparable; ca 3,500,000 tokens (393 Kw Bulgarian; 3.1 Mw Croatian)
Rights:: Not specified

236. BulTreeBank

Type:: corpus
Language:: Bulgarian
Description:: HPSG-based annotation including: constituent structure, dependency relations, named entities (classified as person, organisation, location or other names), coreferential relations. Annotation in XML
Rights:: Not specified

237. BulTreeBank Frequency List

Type:: lexicalConceptualResource
Language:: Bulgarian
Description:: 100 000 most frequent Cyrillic tokens in the BulTreeBank text archive, UTF-16 list of token-frequency pairs
Rights:: Not specified

238. BulTreeBank Morphological Analyzer

Creator:: Simov, Kiril and Osenova, Petya
Publisher:: Linguistic Modeling Department, IPP, Bulgarian Academy of Sciences
Type:: toolService
Description:: It is used morphological lexicon of Bulgarian (100 000 lemmas) compiled as a finite-state automaton in CLaRK System. It requires the text to be first tokenized and it is applied in each token. Includes also guessers for unknown words and Named Entities gazetteers. If the corresponding resources are available for a different language, then it can be tuned to it.
Rights:: Not specified

239. BulTreeBank Morphosyntactic Corpus

Type:: corpus
Language:: Bulgarian
Description:: Written, synchronic, general, manually annotated, 1 000 000 tokens divided in three sets: 215 000 tokens used in BulTreeBank HPSG Treebank (see below), additionally 300 000 checked second time, rest about 480 000 checked by the annotators. Morphosyntactic annotation with the BulTreeBank Tagset (http://www.bultreebank.org/TechRep/BTB-TR03.pdf), XML, annotation description in technical reports of BulTreeBank project http://www.bultreebank.org/TechRep
Rights:: Not specified

240. BulTreeBank Morphosyntactic Disambiguator

Creator:: Simov, Kiril, Osenova, Petya, and Simov, Alex
Publisher:: Linguistic Modeling Department, IPP, Bulgarian Academy of Sciences
Type:: toolService
Description:: This is a hybrid system: rules, neural network, rules. First rules for the sure cases are applied, then a neural network disambiguator is applied, then rules for repairing of the most frequent errors of the neural network. The rules are implemented as constraints in CLaRK System. The neural network is additional module implemented in Java. It is called CLaRK. It requires the morphologically annotated input.
Rights:: Not specified

241. BulTreeBank POS Corpus

Type:: corpus
Language:: Bulgarian
Description:: Written, synchronic, general, manually annotated; 50 000 tokens, 2600 sentences extracted from the BulTreeBank Text Archive in order to contain the most frequent ambiguity classes in Bulgarian
Rights:: Not specified

242. BulTreeBank Stopword List

Type:: lexicalConceptualResource
Language:: Bulgarian
Description:: 805 prepositions, pronouns, etc stop words, UTF-16 list of wordforms
Rights:: Not specified

243. BulTreeBank Text Archive

Type:: corpus
Language:: Bulgarian
Description:: 72 000 000 tokens, 15% fiction, 78% newspapers and 7% legal texts, government bulletins and others
Rights:: Not specified

244. BulTreeBank Tokenizer

Creator:: Simov, Kiril
Publisher:: Linguistic Modeling Department, IPP, Bulgarian Academy of Sciences
Type:: toolService
Description:: The tokenizer is covering all languages that use Latin1, Laitn2, Latin3 and Cyrillic tables of Unicode. Can be extended to cover other tables in Unicode if necessary. The implementation is as a cascaded regular grammar in CLaRK. It recognizes over 60 token categories. It is easy to be adapted to new token categories.
Rights:: Not specified

245. BUSCANEO

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Language:: Catalan and Spanish
Description:: Tool for neologism extraction.
Rights:: Not specified

246. BushBank

Creator:: Grác, Marek
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: interannotator agreement, corpus, chunks, phrases, and clauses
Language:: Czech
Description:: Czech corpus annotated for NP and clause chunks by 3-11 annotators (with average inter-annotator agreement at 88%). It consists of 10,000 sentences.
Rights:: Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB

247. Bústia Neològica Escolar

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Language:: Catalan and Spanish
Description:: Terminology management
Rights:: Not specified

248. Bwananet

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Language:: Catalan, English, and Spanish
Description:: Tool for querying the Technical Corpus of the Institut Universitari de Lingüística Aplicada.
Rights:: Not specified

249. C4Corpus (CC BY-NC part)

Creator:: Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
Publisher:: Technische Universität Darmstadt
Type:: text and corpus
Subject:: CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
Language:: Afrikaans, Arabic, Bengali, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Latvian, Lithuanian, Malayalam, Macedonian, Nepali (macrolanguage), Dutch, Norwegian, Panjabi, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Spanish, Albanian, Swahili (macrolanguage), Swedish, Tamil, Telugu, Tagalog, Thai, Turkish, Ukrainian, Undetermined, Vietnamese, and Chinese
Description:: A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

250. C4Corpus (CC BY-NC-ND part)

Creator:: Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
Publisher:: Technische Universität Darmstadt
Type:: text and corpus
Subject:: CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
Language:: Afrikaans, Arabic, Bengali, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Latvian, Lithuanian, Malayalam, Marathi, Macedonian, Nepali (macrolanguage), Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Spanish, Albanian, Swahili (macrolanguage), Swedish, Tamil, Telugu, Tagalog, Thai, Turkish, Ukrainian, Undetermined, Urdu, Vietnamese, and Chinese
Description:: A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:: Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB

251. C4Corpus (CC BY-NC-SA part)

Creator:: Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
Publisher:: Technische Universität Darmstadt
Type:: text and corpus
Subject:: CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
Language:: Afrikaans, Arabic, Bengali, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Malayalam, Marathi, Macedonian, Nepali (macrolanguage), Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Spanish, Albanian, Swahili (macrolanguage), Swedish, Tamil, Telugu, Tagalog, Thai, Turkish, Ukrainian, Undetermined, Urdu, Vietnamese, and Chinese
Description:: A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

252. C4Corpus (CC BY-ND part)

Creator:: Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
Publisher:: Technische Universität Darmstadt
Type:: text and corpus
Subject:: CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
Language:: Afrikaans, Arabic, Bengali, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Malayalam, Macedonian, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Spanish, Albanian, Swahili (macrolanguage), Swedish, Tamil, Tagalog, Thai, Turkish, Ukrainian, Undetermined, Vietnamese, and Chinese
Description:: A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:: Creative Commons - Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

253. C4Corpus (CC BY-SA part)

Creator:: Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
Publisher:: Technische Universität Darmstadt
Type:: text and corpus
Subject:: CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
Language:: Afrikaans, Arabic, Bengali, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Latvian, Lithuanian, Malayalam, Marathi, Macedonian, Nepali (macrolanguage), Dutch, Norwegian, Panjabi, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Spanish, Albanian, Swahili (macrolanguage), Swedish, Tamil, Telugu, Tagalog, Thai, Turkish, Ukrainian, Undetermined, Urdu, Vietnamese, and Chinese
Description:: A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB

254. C4Corpus (CC-BY part)

Creator:: Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
Publisher:: Technische Universität Darmstadt
Type:: text and corpus
Subject:: CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
Language:: Afrikaans, Arabic, Bengali, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Latvian, Lithuanian, Malayalam, Marathi, Macedonian, Nepali (macrolanguage), Dutch, Norwegian, Panjabi, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Spanish, Albanian, Swahili (macrolanguage), Swedish, Tamil, Telugu, Tagalog, Thai, Turkish, Ukrainian, Undetermined, Urdu, Vietnamese, and Chinese
Description:: A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

255. C4Corpus (publicdomain part)

Creator:: Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
Publisher:: Technische Universität Darmstadt
Type:: text and corpus
Subject:: CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
Language:: Afrikaans, Arabic, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Croatian, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Dutch, Norwegian, Polish, Portuguese, Russian, Slovenian, Somali, Spanish, Swahili (macrolanguage), Swedish, Tagalog, Thai, Turkish, Ukrainian, Undetermined, and Vietnamese
Description:: A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:: Public Domain Mark (PD), http://creativecommons.org/publicdomain/mark/1.0/, and PUB

256. calcular_p_cue_class

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Description:: Statistical analysis service: It calculates P(cue|class): probability of seeing a linguistic cue given a lexical class. This probability is computed given the occurrences of cues in a corpus (codified in the signatures file) and the information of belonging or not belonging of these words to different classes (codified in indicators file). The probability is computed for each studied cue in the signatures file and for each class in the indicators file.
Rights:: Not specified

257. CALEM (Comprehensive Arabic LEMmas)

Creator:: Namly, Driss, Bouzoubaa, Karim, and El Jihad, Abdelhamid
Publisher:: ALELM
Type:: text, lexicon, and lexicalConceptualResource
Subject:: lexicon, lemmatization, and stemming;
Language:: Arabic
Description:: Comprehensive Arabic LEMmas is a lexicon covering a large list of Arabic lemmas and their corresponding inflected word forms (stems) with details (POS + Root). Each lexical entry represents a lemma followed by all its possible stems and each stem is enriched by its morphological features especially the root and the POS. It is composed of 164,845 lemmas representing 7,200,918 stems, detailed as follow: 757 Arabic particles 2,464,631 verbal stems 4,735,587 nominal stems The lexicon is provided as an LMF conformant XML-based file in UTF8 encoding, which represents about 1,22 Gb of data. Citation: – Namly Driss, Karim Bouzoubaa, Abdelhamid El Jihad, and Si Lhoussain Aouragh. “Improving Arabic Lemmatization Through a Lemmas Database and a Machine-Learning Technique.” In Recent Advances in NLP: The Case of Arabic Language, pp. 81-100. Springer, Cham, 2020.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

258. canoonet – Deutsche Wörterbücher und Grammatik

Type:: lexicalConceptualResource
Subject:: Germanistik
Language:: German
Description:: Angabe von orthographischen, morphologischen (Wortformenbildung und Wortbildung) sowie semantischen Informationen (Synonymie; Hyperonymie/Hyponymie); Zuordnung der Wörter zu der jeweiligen syntaktischen Kategorie (bei Substantiven zusätzlich Angabe des Genus)
Rights:: Not specified

259. Čarek Jan (poet)

Creator:: Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: Galerie osobností, Places::Praha::Nové Město::Školská::pavlač domu, and People::Čarek Jan (1898-1966)
Language:: No linguistic content
Description:: Poet Jan Čarek on Bohumil Veselý's balcony.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

261. Carl-Maria-von-Weber-Gesamtausgabe (WeGA)

Publisher:: Musikwissenschaftliches Seminar Detmold/Paderborn and Staatsbibliothek zu Berlin (Musikabteilung)
Type:: corpus
Subject:: Germanistik
Language:: German
Description:: Weber's diary entries, letters, writings, and opera; Tagebücher, Briefe, Schriften und Werke Webers
Rights:: Not specified

262. Cashinahua corpus

Publisher:: Max Planck Institute for Evolutionary Anthropology and Université Paris X Nanterre
Type:: corpus
Description:: Documentation of the Cashinahua project (DoBeS project)
Rights:: Code of conduct

263. CAST corpus (Computer-Aided Summarisation Tool)

Publisher:: Research Group in Computational Linguistics, University of Wolverhampton
Type:: corpus
Language:: English
Description:: Sentences annotated for important units of text for summarisation. 145,473 words / 6584 sentences
Rights:: Not specified

264. Catalan Annotated Corpora CQP

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Description:: This RESTful service allows to define a sub-corpus from different annotated corpora. The service includes a POS tag harmonisation process where original tags are converted to EAGLES/Parole format. The eventual sub-corpus is indexed using the IMS CWB tool. The user receives an ID which can be used by the CQP service to exploit the sub-corpus.
Rights:: Not specified

265. Catalan Digital Press

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Description:: This RESTful service accesses part of the Hemeroteca Digital de l’Arxiu Municipal de Girona (digital press archive from the Girona city council), specifically Catalan press from 2003. The service uses the SRU protocol.
Rights:: Not specified

266. catdoc

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Description:: Format conversion service: Word .doc to .txt converter
Rights:: Not specified

267. CEHugeWebCorpus

Creator:: Rüdiger, Jan Oliver
Publisher:: Rüdiger, Jan Oliver
Type:: text and corpus
Subject:: corpus, German, Germanistik, Web corpus, web corpora, and CorpusExplorer
Language:: German
Description:: This corpus was originally created for performance testing (server infrastructure CorpusExplorer - see: diskurslinguistik.net / diskursmonitor.de). It includes the filtered database (German texts only) of CommonCrawl (as of March 2018). First, the URLs were filtered according to their top-level domain (de, at, ch). Then the texts were classified using NTextCat and only uniquely German texts were included in the corpus. The texts were then annotated using TreeTagger (token, lemma, part-of-speech). 2.58 million documents - 232.87 million sentences - 3.021 billion tokens. You can use CorpusExplorer (http://hdl.handle.net/11234/1-2634) to convert this data into various other corpus formats (XML, JSON, Weblicht, TXM and many more).
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), PUB, and http://creativecommons.org/licenses/by-nc-sa/4.0/

268. Celebration in front of the Municipal House on 28 October 1932

Creator:: (:unav) Unknown author
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: 28. říjen 1932, shromáždění slavnostní, vůz přenosový Philips, mikrofon, projev veřejný, legionář, reproduktory, vojáci českoslovenští, výročí vznik ČSR 14., Vznik ČSR, Places::Praha::Staré Město::náměstí Republiky::Obecní dům, and People::Soukup František (1871-1940)
Language:: No linguistic content
Description:: Report from the celebration of the fourteenth anniversary of the Czechoslovak Republic held in front of the Municipal House in Prague on 28 October 1932. The gathering was attended by troops and legionnaires. A Philips Radio broadcast vehicle stands in front of the entrance. The segment includes a silent recording of a speech given by the Former Secretary of the National Committee and current Chairman of the Senate František Soukup.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

269. Celebration of 28 October 1933 in Mladá Boleslav

Creator:: Moderna Mladá Boleslav
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: Vznik ČSR and Mladoboleslavský journál (4)
Language:: No linguistic content
Description:: Segment from the celebration of the fifteenth anniversary of the Czechoslovak Republic held on Old Town Square in Mladá Boleslav. A ceremonial line-up of the local military garrison. General Šípek delivers a speech before the municipal council, soldiers and inhabitants of the town. This is followed by a parade through the town, attended by the troops, representatives of the Sokol community, the local fire brigade, the gendarmerie, nurses and members of other town associations. Footage from a footrace through the streets of Mladá Boleslav to the Monument to the Fallen, won by Mr Mlejnek. Footage from the celebration in Kosmonosy, the place linked with the ground-breaking ceremony for the Resistance Memorial. The ceremony was attended by representatives of local associations and corporations as well as the county association of Czechoslovak legionnaires.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

270. Celebration on 28 October 1930

Creator:: Deglové
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: přehlídka vojenská, 28. říjen 1930, nádvoří hradní, vojáci českoslovenští, 28. pluk Pražské děti, Vznik ČSR, Places::Praha::Hradčany::Pražský hrad::třetí hradní nádvoří, People::Udržal František (1866-1938), and People::Viškovský Karel (1868-1932)
Language:: No linguistic content
Description:: The segment captures a military parade of new army recruits, held in the third courtyard of Prague Castle on 28 October 1930 as part of the celebration of the twelfth anniversary of the Czechoslovak Republic. Prime Minister František Udržal and Minister of Defence Karel Viškovský attend the parade, standing in for the absent President Masaryk.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

271. Celebration on 28 October 1931 in Bratislava

Creator:: (:unav) Unknown author
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: výročí vznik ČSR 13., projevy veřejné, 39. pěší pluk, stuhy připínané na prapor, Sokolové, balda, kapelník s baldou, standarta vojenská, nádraží, znak zemský Čechy, znak zemský Slovensko, přehlídka vojenská, automobily vojenské na přehlídce, lidé v krojích mávající, 54. dělostřelecký pluk, oslavy výročí 28. října, Sokolové v průvodu, hasiči v průvodu, Vznik ČSR, Places::Slovensko::Bratislava, People::Országh Jozef (1883-1949), People::Škvor František (1868-1941), and People::Krno Vladimír (1874-1955)
Language:: No linguistic content
Description:: The segment captures a military parade of new army recruits, held in Bratislava on 28 October 1931 as part of the celebration of the thirteenth anniversary of the Czechoslovak Republic. The celebration opens with speeches by General František Škvor and Vladimír Krno, the Mayor of Bratislava. The city of Bratislava honours the 39th Infantry Regiment with the Czechoslovak War Cross and the honorary title "General Graziani´s Infantry Regiment of Reconnaissance". The 54th Artillery Regiment adopts the standard of the 109th and 153rd Artillery Regiments, made by the residents of the Jedlička Institute in Prague. The cavalry and infantry together with the military band take charge of the following parade.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

272. Celebration on 28 October 1932 in Prague

Creator:: Elektajournal
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: 28. říjen 1932, vlajka československá vztyčování, vůz přenosový Philips, výročí vznik ČSR 14., Vznik ČSR, Places::Praha::Staré Město::náměstí Republiky::Obecní dům, and Places::Praha::Nové Město::náměstí Republiky::kasárna
Language:: No linguistic content
Description:: Report from the celebration of the fourteenth anniversary of the Czechoslovak Republic held in Prague on 28 October 1932. An aerial view of a parade in front of the Old Town City Hall. Gathering in front of the Municipal House on Republic Square.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

273. CELEX (web version)

Publisher:: Max Planck Institute for Psycholinguistics
Type:: lexicalConceptualResource
Language:: Dutch, English, and German
Rights:: Not specified

275. Čeněk Zahradníček (cinematographer)

Creator:: Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: kamera filmová, Galerie osobností, and People::Zahradníček Čeněk (1900-1989)
Language:: No linguistic content
Description:: Footage of cinematographer Čeněk Zahradníček at work with his camera.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

276. Cercador NEOROM

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Language:: Catalan and Spanish
Description:: Search engine for the neologisms database of the NEOROM network. The network collects neologisms used in the press written in Romance languages from 2005 onwards.
Rights:: Not specified

277. Cercador OBNEO

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Description:: Search engine of the BOBNEO data bank, a database of neologisms present in the mass media in Spanish and Catalan, written and oral, from 1992.
Rights:: Not specified

278. CERED baseline models

Creator:: Šimečková, Zuzana and Straka, Milan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: mlmodel, text, and languageDescription
Subject:: relationship extraction
Language:: Czech
Description:: Relationship extraction models for the Czech language. Models are trained on CERED (dataset created by distant supervision on Czech Wikipedia and Wikidata) and recognize a subset of Wikidata relations (listed in CEREDx.LABELS). We supply a demo.py that performs inference on user-defined input and requirements.txt file for pip. Adapt the demo code to use the model. Both the dataset and the models are presented in Relationship Extraction thesis.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), PUB, and http://creativecommons.org/licenses/by-nc-sa/4.0/

279. Ceremonial unveiling of the Ernst Denis Memorial

Creator:: Deglové
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: pomník Denis Ernest maketa, pomník Denis Ernest odhalení, stavba podstavce pomníku, sochař při práci, Sokolové na shromáždění slavnostním, shromáždění slavnostní, ateliér sochařský, generál československý, generál francouzský, projev veřejný, pomník Dvořák Karel, výročí vznik ČSR 10., Vznik ČSR, Places::Praha::Malá Strana::Malostranské náměstí::pomník Ernsta Denise, Places::Praha::Janáčkovo nábřeží::ateliér Karla Dvořáka /int./, People::Hodža Milan (1878-1944), People::Baxa Karel (1862-1938), People::Syrový Jan (1888-1970), People::Beneš Edvard (1884-1948), and People::Masaryk Tomáš Garrigue (1850-1937)
Language:: No linguistic content
Description:: The segment captures events preceding the installation and subsequent unveiling of the memorial statue of French historian and Slavonic scholar Ernest Denis on Lesser Town Square in Prague. Members of the Commission for the Construction of Denis´s Memorial use a maquette to find the best place for the statue. The event is witnessed by the artist, the sculptor Karel Dvořák. A shot of Dvořák in his studio in the courtyard of a house on Janáček Embankment in Smíchov, Prague. Digging works for the pedestal followed by the unveiling of the memorial on 27 October 1928, the eve of the tenth anniversary of the Czechoslovak Republic. President Tomáš Garrigue Masaryk, Minister of Education Milan Hodža, Prague Mayor Karel Baxa, General Jan Syrový, MP Antonín Uhlíř, French General Eugene Mittelhauser, French politician Alfred Oberkirch and others are present on the grandstand. Speech by Minister of Foreign Affairs Edvard Beneš. An image of the President T. G. Masaryk and Edvard Beneš.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

280. Česílko

Creator:: Hajič, Jan, Kuboň, Vladislav, and Homola, Petr
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: toolService
Subject:: machine translation and Czech-Slovak translation
Language:: Czech
Description:: Česílko is a tool enabling the fast and efficient translation from one source language into many target languages, which are mutually related.
Rights:: Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB

281. Česílko 2.0 Shallow Transfer RBMT framework (opensource version)

Creator:: Vičič, Jernej, Kuboň, Vladislav, and Homola, Petr
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: tool and toolService
Subject:: Shallow Parse, Shallow Transfer Rule-Based Machine Translation, stochastic ranker, related languages, and toolbox
Description:: The system Česílko (language data and software tools) was first developed as an answer to a growing need of translation and localisation from one source language to many target languages. The starting system belonged to the Shallow Parse, Shallow Transfer Rule-Based Machine Translation – (RBMT) paradigm and it was designed primarily for translation of related languages. The latest implementation of the system uses a stochastic ranker; so technically it belongs to the hybrid machine translation paradigm, using stochastic methods combined with the traditional Shallow Transfer RBMT methods. The system has been stripped of the accompanying language resources due to copyright restrictions. The data that is available is just for demonstrative purposes.
Rights:: Not specified

282. Cesilko Web Service for Weblicht

Creator:: Hajič, Jan, Kuboň, Vladislav, and Homola, Petr
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: toolService and service
Subject:: machine translation
Description:: Weblicht integration of Cesilko (http://hdl.handle.net/11858/00-097C-0000-0006-AAFE-A)
Rights:: Not specified

283. Čestmír Loukotka (philologist)

Creator:: Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: Galerie osobností, Places::Praha::Nové Město::Školská::pavlač Bohumila Veselého, and People::Loukotka Čestmír (1895-1966)
Language:: No linguistic content
Description:: Philologist Čestmír Loukotka on Bohumil Veselý's balcony.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

284. Chaco languages corpus

Type:: corpus
Description:: Documentation of the Chaco project (DoBeS project)
Rights:: Code of conduct

285. Chared

Creator:: Pomikálek, Jan
Publisher:: Masaryk University, NLP Centre
Type:: toolService and tool
Subject:: character encoding, character encoding detection, charset, and unicode
Language:: English
Description:: Chared is a software tool which can detect character encoding of a text document provided the language of the document is known. The language of the text has to be specified as an input parameter so that the corresponding language model can be used. The package contains models for a wide range of languages (currently 57 --- covering all major languages). Furthermore, it provides a training script to learn models for additional languages using a set of user supplied sample html pages in the given language. The detection algorithm is based on determining similarity of byte trigrams vectors. In general, chared should be more accurate than other character encoding detection tools with no language constraints. This is an important advantage allowing precise character decoding needed for building large textual corpora. The tool has been used for building corpora in American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting of 70 billions tokens altogether. Chared is an open source software, licensed under New BSD License and available for download (including the source code) at http://code.google.com/p/chared/. The research leading to this piece of software was published in POMIKÁLEK, Jan a Vít SUCHOMEL. chared: Character Encoding Detection with a Known Language. In Aleš Horák, Pavel Rychlý. RASLAN 2011. 5. vyd. Brno, Czech Republic: Tribun EU, 2011. od s. 125-129, 5 s. ISBN 978-80-263-0077-9. and PRESEMT, Lexical Computing Ltd
Rights:: BSD 3-Clause "New" or "Revised" license, http://opensource.org/licenses/BSD-3-Clause, and PUB

286. Children Playing in Gas Mask in Prague

Creator:: Aktualita
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: děti v plynových maskách, masky plynové, koloběžka, tříkolka, hry dětské v maskách, Mnichovská dohoda, and Český zvukový týdeník Aktualita::1938/33
Language:: Czech
Description:: The segment of Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel), 1938, issue no. 33 shows children playing in gas masks in Prague.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

287. Chintang / Puma corpus

Publisher:: University of Leipzig
Type:: corpus
Description:: Documentation of the Chintang / Puma project (DoBeS project)
Rights:: Code of conduct

288. Chipaya

Type:: corpus
Description:: Documentation of the Chipaya project (DoBeS project)
Rights:: Code of conduct

289. Chontal corpus

Type:: corpus
Description:: Documentation of the Chontal project (DoBeS project)
Rights:: Code of conduct

290. Christmas exhibition in the hall of the Black Rose Palace

Creator:: Aktualita
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: stromek vánoční, hvězda vánoční, znak Kuratorium pro výchovu mládeže, akce Kuratorium pro výchovu mládeže, Kuratorium pro výchovu mládeže akce, dárky vánoční pro děti, dívky v krojích, kroje lidové, hračky rozdávání, Kuratorium, Places::Praha::Nové Město:: Na Příkopě::palác Černá růže /int./, People::Moravec Emanuel (1893-1945), People::Teuner František (1911-1978), and Český zvukový týdeník Aktualita::1944
Language:: Czech
Description:: Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 1A from 1944 was shot during a Christmas exhibition organised by the Board of Trustees for the Education of Youth and held in the hall of the Black Rose Palace in Na Příkopě Street in Prague from 18 to 22 December. The exhibition included a display of the 500 prettiest toys made as part of the Sewing Dolls initiative. Girls made 59,000 dolls, out of which 44,000 went to the children of the labourers working in the Reich and 15,000 to the children of the German soldiers fighting on the front. The exhibition was toured by Minister of Education and People´s Enlightenment and Chairman of the Board Emanuel Moravec and General Secretary of the Board František Teuner.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

291. CLaRK System - an XML-based system for Corpora Development

Publisher:: Linguistic Modeling Department, IPP, Bulgarian Academy of Sciences
Type:: toolService
Subject:: corpus development
Description:: The CLaRK System incorporates several technologies: - XML technology - Unicode - Cascaded Regular Grammars; - Constraints over XML Documents On the basis of these technologies the following tools are implemented: XML Editor, Unicode Tokeniser, Sorting tool, Removing and Extracting tool, Concordancer, XSLT tool, Cascaded Regular Grammar tool, etc. 1 Unicode tokenization In order to provide possibility for imposing constraints over the textual node and to segment them in meaningful way, the CLaRK System supports a user-defined hierarchy of tokenisers. At the very basic level the user can define a tokeniser in terms of a set of token types. In this basic tokeniser each token type is defined by a set of UNICODE symbols. Above this basic level tokenisers, the user can define other tokenisers, for which the token types are defined as regular expressions over the tokens of some other tokeniser, the so called parent tokeniser. 2 Regular Grammars The regular grammars are the basic mechanism for linguistic processing of the content of an XML document within the system. The regular grammar processor applies a set of rules over the content of some elements in the document and incorporates the categories of the rules back in the document as XML mark-up. The content is processed before the application of the grammar rules in the following way: textual nodes are tokenized with respect to some appropriate tokeniser, the element nodes are textualized on the basis of XPath expressions that determine the important information about the element. The recognized word is substituted by a new XML mark-up, which can or can not contain the word. 3 Constraints The constraints that we implemented in the CLaRK System are generally based on the XPath language. We use XPath expressions to determine some data within one or several XML documents and thus we evaluate some predicates over the data. There are two modes of using a constraint. In the first mode the constraint is used for validity check, similar to the validity check, which is based on DTD or XML schema. In the second mode, the constraint is used to support the change of the document in order it to satisfy the constraint. There are three types of constraints, implemented in the system: regular expression constraints, number restriction constraints, value restriction constraints. 4 Macro Language In the CLaRK System the tools support a mechanism for describing their settings. On the basis of these descriptions (called queries) a tool can be applied only by pointing to a certain description record. Each query contains the states of all settings and options which the corresponding tool has. Once having this kind of queries there is a special tool for combining and applying them in groups (macros). During application the queries are executed successively and the result from an application is an input for the next one. For a better control on the process of applying several queries in one we introduce several conditional operators. These operators can determine the next query for application depending on certain conditions. When a condition for such an operator is satisfied, the execution continues from a location defined in the operator. The mechanism for addressing queries is based on user defined labels. When a condition is not satisfied the operator is ignored and the process continues from the position following the operator. In this way constructions like IF-THEN-ELSE and WHILE-DO easily can be expressed. The system supports five types of control operators: IF (XPath): the condition is an XPath expression which is evaluated on the current working document. If the result is a non-empty node-set, non-empty string, positive number or true boolean value the condition is satisfied; IF NOT (XPath): the same kind of condition as the previous one but the approving result is negated; IF CHANGED: the condition is satisfied if the preceding operation has changed the current working document or has produced a non-empty result document (depending on the operation); IF NOT CHANGED: the condition is satisfied if either the previous operation did not change the working document or did not produce a non-empty result. GOTO: unconditional changing the execution position. Each macro defined in the system can have its own query and can be incorporated in another macro. In this way some limited form of subroutine can be implemented. The new version of CLaRK will support server applications, calls to/from external programs.
Rights:: Not specified

292. CLaRK System - XML-based system for Corpora Development

Creator:: Simov, Kiril, Simov, Alex, and Kouylekov, Milen
Publisher:: Linguistic Modeling Department, IPP, Bulgarian Academy of Sciences
Type:: toolService
Description:: The CLaRK System incorporates several technologies: - XML technology - Unicode - Cascaded Regular Grammars; - Constraints over XML Documents On the basis of these technologies the following tools are implemented: XML Editor, Unicode Tokeniser, Sorting tool, Removing and Extracting tool, Concordancer, XSLT tool, Cascaded Regular Grammar tool, etc. 1 Unicode tokenization In order to provide possibility for imposing constraints over the textual node and to segment them in meaningful way, the CLaRK System supports a user-defined hierarchy of tokenisers. At the very basic level the user can define a tokeniser in terms of a set of token types. In this basic tokeniser each token type is defined by a set of UNICODE symbols. Above this basic level tokenisers, the user can define other tokenisers, for which the token types are defined as regular expressions over the tokens of some other tokeniser, the so called parent tokeniser. 2 Regular Grammars The regular grammars are the basic mechanism for linguistic processing of the content of an XML document within the system. The regular grammar processor applies a set of rules over the content of some elements in the document and incorporates the categories of the rules back in the document as XML mark-up. The content is processed before the application of the grammar rules in the following way: textual nodes are tokenized with respect to some appropriate tokeniser, the element nodes are textualized on the basis of XPath expressions that determine the important information about the element. The recognized word is substituted by a new XML mark-up, which can or can not contain the word. 3 Constraints The constraints that we implemented in the CLaRK System are generally based on the XPath language. We use XPath expressions to determine some data within one or several XML documents and thus we evaluate some predicates over the data. There are two modes of using a constraint. In the first mode the constraint is used for validity check, similar to the validity check, which is based on DTD or XML schema. In the second mode, the constraint is used to support the change of the document in order it to satisfy the constraint. There are three types of constraints, implemented in the system: regular expression constraints, number restriction constraints, value restriction constraints. 4 Macro Language In the CLaRK System the tools support a mechanism for describing their settings. On the basis of these descriptions (called queries) a tool can be applied only by pointing to a certain description record. Each query contains the states of all settings and options which the corresponding tool has. Once having this kind of queries there is a special tool for combining and applying them in groups (macros). During application the queries are executed successively and the result from an application is an input for the next one. For a better control on the process of applying several queries in one we introduce several conditional operators. These operators can determine the next query for application depending on certain conditions. When a condition for such an operator is satisfied, the execution continues from a location defined in the operator. The mechanism for addressing queries is based on user defined labels. When a condition is not satisfied the operator is ignored and the process continues from the position following the operator. In this way constructions like IF-THEN-ELSE and WHILE-DO easily can be expressed. The system supports five types of control operators: IF (XPath): the condition is an XPath expression which is evaluated on the current working document. If the result is a non-empty node-set, non-empty string, positive number or true boolean value the condition is satisfied; IF NOT (XPath): the same kind of condition as the previous one but the approving result is negated; IF CHANGED: the condition is satisfied if the preceding operation has changed the current working document or has produced a non-empty result document (depending on the operation); IF NOT CHANGED: the condition is satisfied if either the previous operation did not change the working document or did not produce a non-empty result. GOTO: unconditional changing the execution position. Each macro defined in the system can have its own query and can be incorporated in another macro. In this way some limited form of subroutine can be implemented. The new version of CLaRK will support server applications, calls to/from external programs.
Rights:: Not specified

293. CLEF-TREC Q/A

Creator:: Abouenour, lahcen, Bouzoubaa, Karim, and Rosso, paolo
Publisher:: ALELM
Type:: text, other, and lexicalConceptualResource
Subject:: CLEF and TREC
Language:: Arabic
Description:: List of 2264 questions + answers of CLEF and TREC, translated to Arabic
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

294. CLIPS : corpora e lessici di italiano parlato e scritto

Publisher:: Università degli studi di Napoli Federico II
Type:: corpus
Language:: Italian
Description:: Audio files of about 100 hours of speech from 15 different cities in Italy. Various recordings are transcribed to read in PDF
Rights:: Not specified

295. CoCzeFLA Chroma 2022.07

Creator:: Chromá, Anna and Matiasovitsová, Klára
Publisher:: Faculty of Arts, Charles Univesity
Type:: text and corpus
Subject:: first language acquisition, typical development, and monolingual corpus
Language:: Czech
Description:: Transcripts of longitudinal audio recordings of 7 Czech typical monolingual children between 1;7 to 3;9. Files are in plain text with UTF-8 encoding. Each file represents one recording session of one of the target children and is named with the presudonym of the child and her age at the given session in form YMMDD. Transcription rules and other details are to find on the homepage coczefla.ff.cuni.cz.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

296. CoCzeFLA Chroma 2023.04

Creator:: Chromá, Anna, Matiasovitsová, Klára, Sláma, Jakub, and Treichelová, Jolana
Publisher:: Charles University, Faculty of Arts
Type:: text and corpus
Subject:: first language acquisition, typical development, longitudinal corpus, and Czech
Language:: Czech
Description:: A new version of the previously published corpus Chroma. The version 2023.04 includes six children. Two transcripts (Julie20221, Klara30424) were removed since they did not meet the criteria on the dialogical format. The transcripts were revised (eliminating typing errors and inconsistencies in the transcription format) and morphologically annotated by the automatic tool MorphoDiTa. Detailed manual control of the annotation was performed on children's utterances; the annotation of adult data was not checked yet. Files are in plain text with UTF-8 encoding. Each file represents one recording session of one of the target children and is named with the alias of the child and their age at the given session in form YMMDD. Transcription rules and other details can be found on the homepage coczefla.ff.cuni.cz.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

297. CoCzeFLA Chroma 2023.07

Creator:: Chromá, Anna, Sláma, Jakub, Matiasovitsová, Klára, and Kohoutková, Jolana
Publisher:: Charles University, Faculty of Arts
Type:: text and corpus
Subject:: first language acquisition, typical development, longitudinal corpus, and Czech
Language:: Czech
Description:: A new version of the previously published corpus Chroma wih morphological annotation. The version 2023.07 differs from 2023.04 in that it includes all seven children and it went through an additional careful check of consistency and conformity to the CHAT transcription principles. Two transcripts (Julie20221, Klara30424) from the previous versions (2022.07, 2019.07) were removed since they did not meet our criteria on dialogical format. All transcripts of recordings made during one day were split into one file. Thus, version 2023.07 consists of 183 files/transcripts. The number of utterances and tokens given here in LINDAT corresponds to children's lines only. Files are in plain text with UTF-8 encoding. Each file represents one recording session of one of the target children and is named with the alias of the child and their age at the given session in form YMMDD. Transcription rules and other details can be found on the homepage coczefla.ff.cuni.cz.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

298. Code-switching conversation corpus

Publisher:: Max Planck Institute for Psycholinguistics
Type:: corpus
Language:: Dutch
Description:: The code-switching corpus consists of 5x30-minute conversations between four speakers (i.e. a total of 20 speakers). The speakers are bilingual speakers of Papiamento (a creole langauge spoken in the Dutch Antilles) and Dutch. In the course of their free conversations, they engage in code-switching, that is, they use both languages within the same utterance in systematic ways. The corpus is fully transcribed and glossed, coded for language and word class, in ELAN.
Rights:: Not specified

299. COLDIC

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Description:: Tool for dictionary management
Rights:: Not specified

300. Collection of Latvian literature

Publisher:: Tilde
Type:: corpus
Language:: Latvian
Description:: Masterpieces of Latvian literature from the beginning of Latvian literature until first decades of 20th century
Rights:: Not specified

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Show values starting with

Creator

Show values starting with

Format

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Show values starting with

Date

Original context has metadata only

Harvested from