Rights: PUB - LINDAT/CLARIAH-CZ Catalog Search Results

231. HindMonoCorp 0.5

Creator:: Bojar, Ondřej, Diatka, Vojtěch, Rychlý, Pavel, Straňák, Pavel, Suchomel, Vít, Tamchyna, Aleš, and Zeman, Daniel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: corpus
Language:: Hindi
Description:: Hindi monolingual corpus. It is based primarily on web crawls performed using various tools and at various times. Since the web is a living data source, we treat these crawls as completely separate sources, despite they may overlap. To estimate the magnitude of this overlap, we compared the total number of segments if we concatenate the individual sources (each source being deduplicated on its own) with the number of segments if we de-duplicate all sources to- gether. The difference is just around 1%, confirming, that various web crawls (or their subsequent processings) differ significantly. HindMonoCorp contains data from: Hindi web texts, a monolingual corpus containing mainly Hindi news articles has already been collected and released by Bojar et al. (2008). We use the HTML files as crawled for this corpus in 2010 and we add a small crawl performed in 2013 and re-process them with the current pipeline. These sources are denoted HWT 2010 and HWT 2013 in the following. Hindi corpora in W2C have been collected by Martin Majliš during his project to automatically collect corpora in many languages (Majliš and Žabokrtský, 2012). There are in fact two corpora of Hindi available—one from web harvest (W2C Web) and one from the Wikipedia (W2C Wiki). SpiderLing is a web crawl carried out during November and December 2013 using SpiderLing (Suchomel and Pomikálek, 2012). The pipeline includes extraction of plain texts and deduplication at the level of documents, see below. CommonCrawl is a non-profit organization that regu- larly crawls the web and provides anyone with the data. We are grateful to Christian Buck for extracting plain text Hindi segments from the 2012 and 2013-fall crawls for us. Intercorp – 7 books with their translations scanned and manually alligned per paragraph RSS Feeds from Webdunia.com and the Hindi version of BBC International followed by our custom crawler from September 2013 till January 2014. and LM2010013,
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

232. HMM tagger

Creator:: Krbec, Pavel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: toolService
Subject:: tagger and morphology
Language:: Czech
Description:: The HMM-based Tagger is a software for morphological disambiguation (tagging) of Czech texts. The algorithm is statistical, based on the Hidden Markov Models.
Rights:: GNU General Public License, version 2, http://www.gnu.org/licenses/gpl-2.0.html, and PUB

233. huntoken - tokenizer and sentence splitter

Creator:: Németh, László, Halácsy, Péter, and Kornai, András
Publisher:: Budapest Technical University Media Research Centre
Type:: toolService
Subject:: tokenizer
Description:: HunToken is a rule based tokenizer and sentence boundary detector for Hungarian (and English) texts.
Rights:: GNU Library or "Lesser" General Public License 3.0 (LGPL-3.0), http://opensource.org/licenses/LGPL-3.0, and PUB

234. HWC2023 –Hamburg.de Website Corpus 2023

Creator:: Rüdiger, Jan Oliver
Publisher:: Leibniz-Institut für Deutsche Sprache
Type:: text and corpus
Subject:: corpus, Web corpus, web corpora, Germanistik, German, websites, crawling corpus, and CorpusExplorer
Language:: German
Description:: A petition for a referendum (called: "Schluss mit Gendersprache in Verwaltung und Bildung" / eng.: "abolition of gender language in administration and education") was formed in Hamburg in February 2023. The project "Empirical Gender Linguistics" at the "Leibniz Institute for the German Language" took this as an opportunity to completely scrap the "https://www.hamburg.de" website (except the list of ships in the Port of Hamburg and the yellow page). The Hamburg.de website is the central digital contact point for citizens. The scraped texts were cleaned, processed and annotated using http://www.CorpusExplorer.de (TreeTagger - POS/Lemma information). We use the corpus to analyze the use of words with gender signs.
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), PUB, and http://creativecommons.org/licenses/by-nc-sa/3.0/

235. IDENTICv1.0

Creator:: Larasati, Septina Dian
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: Indonesian-English parallel corpus and parallel corpus
Language:: Indonesian and English
Description:: IDENTIC is an Indonesian-English parallel corpus for research purposes. The corpus is a bilingual corpus paired with English. The aim of this work is to build and provide researchers a proper Indonesian-English textual data set and also to promote research in this language pair. The corpus contains texts coming from different sources with different genres. and The research leading to these results has received funding from the European Commission’s 7th Framework Program under grant agreement no 238405 (CLARA) and by the grant LC536 Centrum Komputacni Lingvistiky of the Czech Ministry of Education.
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

236. IDENTICv1.0-raw

Creator:: Larasati, Septina Dian
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: Indonesian-English parallel corpus and parallel corpus
Language:: Indonesian and English
Description:: Raw Text
Rights:: Creative Commons - Attribution 3.0 Unported (CC BY 3.0), http://creativecommons.org/licenses/by/3.0/, and PUB

237. Image Annotation Tool

Creator:: Roček, Martin
Publisher:: Charles University, Faculty of Arts
Type:: service and toolService
Subject:: tei and javascript
Language:: English
Description:: Image annotation tool is a web application that allows users to mark zones of interest in an image. These zones are then converted to TEI P5 code snippet that can be used in your document to connect the image and the text. This tool was developed to help students and teachers at the Faculty of Arts, Charles University to mark and annotate images of manuscripts.
Rights:: GNU General Public Licence, version 3, http://opensource.org/licenses/GPL-3.0, and PUB

238. Imperative Benefit Evaluation

Creator:: Chromá, Anna and Perevozchikova, Tatiana
Publisher:: Charles University, Faculty of Arts and Eberhard-Karls-Universität Tübingen
Type:: text, other, and lexicalConceptualResource
Subject:: imperative, cost-benefit, illocutionary force, politeness, speech act, directives, speaker-oriented, and Czech
Language:: Czech
Description:: The contribution includes the data frame and the R script (Markdown file) belonging to the paper "Who Benefits from an Imperative? Assessment of Directives on a Benefit-Scale" submitted to the journal Pragmatics on September 2024.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

239. Implemented Spelling Rules

Creator:: Saty, Ahmed, Aouragh, Si Lhoussain, and Bouzoubaa, Karim
Publisher:: Sudan University of Science and Technology
Type:: text, other, and languageDescription
Subject:: Implemented Spelling Rules
Language:: Arabic and English
Description:: The book [1] contains spelling rules classified into ten categories, each category containing many rules. This XML file presents our implemented rules classified with six category tags, as is the case in the book. We implemented 24 rules since the remaining rules require diacritical and morphological analysis that are outside the scope of our present work. References: [1] Dr.Fahmy Al-Najjar, 'Spelling rules in ten easy lessons', Al Kawthar Library,2008. Available: https://www.alukah.net/library/0/53498/%D9%82%D9%88%D8%A7%D8%B9%D8%AF-%D8%A7%D9%84%D8%A5%D9%85%D9%84%D8%A7%D8%A1-%D9%81%D9%8A-%D8%B9%D8%B4%D8%B1%D8%A9-%D8%AF%D8%B1%D9%88%D8%B3-%D8%B3%D9%87%D9%84%D8%A9-pdf/
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

240. Information extraction from EIA documents

Creator:: Lukšová, Ivana and Hladká, Barbora
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: tool and toolService
Subject:: information extraction and rule-based extraction
Language:: Czech
Description:: Environmental impact assessment (EIA) is the formal process used to predict the environmental consequences of a plan. We present a rule-based extraction system to mine Czech EIA documents. The extraction rules work with a set of documents enriched with morphological information and manually created vocabularies of terms supposed to be extracted from the documents, e.g. basic information about the project (address, ID company, ...), data on the impacts and outcomes (waste substances, endangered species, ...), a final opinion. The documents Notice of Intent contains the section BI2 with the information on the scope (capacity) of the plan.
Rights:: GNU General Public Licence, version 3, http://opensource.org/licenses/GPL-3.0, and PUB

231. HindMonoCorp 0.5

232. HMM tagger

233. huntoken - tokenizer and sentence splitter

234. HWC2023 –Hamburg.de Website Corpus 2023

235. IDENTICv1.0

236. IDENTICv1.0-raw

237. Image Annotation Tool

238. Imperative Benefit Evaluation

239. Implemented Spelling Rules

240. Information extraction from EIA documents

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Creator

Show values starting with

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Show values starting with

Original context has metadata only

Harvested from