Harvested from: LINDAT/CLARIAH-CZ repository - LINDAT/CLARIAH-CZ Catalog Search Results

Start Over Harvested from LINDAT/CLARIAH-CZ repository

871. Hindi Visual Genome 1.1

Creator:: Parida, Shantipriya and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: multilingual, neural machine translation, multi-modal, English-Hindi parallel corpus, image captioning, and image annotation
Language:: English and Hindi
Description:: Data ---- Hindi Visual Genome 1.1 is an updated version of Hindi Visual Genome 1.0. The update concerns primarily the text part of Hindi Visual Genome, fixing translation issues reported during WAT 2019 multimodal task. In the image part, only one segment and thus one image were removed from the dataset. Hindi Visual Genome 1.1 serves in "WAT 2020 Multi-Modal Machine Translation Task". Hindi Visual Genome is a multimodal dataset consisting of text and images suitable for English-to-Hindi multimodal machine translation task and multimodal research. We have selected short English segments (captions) from Visual Genome along with associated images and automatically translated them to Hindi with manual post-editing, taking the associated images into account. The training set contains 29K segments. Further 1K and 1.6K segments are provided in a development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome. A third test set is called ``challenge test set'' consists of 1.4K segments and it was released for WAT2019 multi-modal task. The challenge test set was created by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity. The surrounding words in the sentence however also often include sufficient cues to identify the correct meaning of the ambiguous word. Dataset Formats -------------- The multimodal dataset contains both text and images. The text parts of the dataset (train and test sets) are in simple tab-delimited plain text files. All the text files have seven columns as follows: Column1 - image_id Column2 - X Column3 - Y Column4 - Width Column5 - Height Column6 - English Text Column7 - Hindi Text The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width and Height columns indicate the rectangular region in the image described by the caption. Data Statistics ---------------- The statistics of the current release is given below. Parallel Corpus Statistics --------------------------- Dataset Segments English Words Hindi Words ------- --------- ---------------- ------------- Train 28930 143164 145448 Dev 998 4922 4978 Test 1595 7853 7852 Challenge Test 1400 8186 8639 ------- --------- ---------------- ------------- Total 32923 164125 166917 The word counts are approximate, prior to tokenization. Citation -------- If you use this corpus, please cite the following paper: @article{hindi-visual-genome:2019, title={{Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation}}, author={Parida, Shantipriya and Bojar, Ond{\v{r}}ej and Dash, Satya Ranjan}, journal={Computaci{\'o}n y Sistemas}, volume={23}, number={4}, pages={1499--1505}, year={2019} }
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

872. Hindi Web Texts

Creator:: Bojar, Ondřej, Straňák, Pavel, and Zeman, Daniel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: news and web texts
Language:: Hindi
Description:: A Hindi corpus of texts downloaded mostly from news sites. Contains both the original raw texts and an extensively cleaned-up and tokenized version suitable for language modeling. 18M sentences, 308M tokens and FP7-ICT-2007-3-231720 (EuroMatrix Plus), 7E09003 (Czech part of EM+)
Rights:: Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0), http://creativecommons.org/licenses/by-nc/3.0/, and PUB

873. HinDialect 1.1: 26 Hindi-related languages and dialects of the Indic Continuum in North India

Creator:: Bafna, Niyati, Žabokrtský, Zdeněk, España-Bonet, Cristina, van Genabith, Josef, Kumar, Lalit "Samyak Lalit", Suman, Sharda, and Shivay, Rahul
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) and Kavita Kosh Project
Type:: text and corpus
Subject:: dialect continuum, dialect variation, Indic, Indo-Aryan, Indian, and Hindi
Language:: Hindi, Marathi, Magahi, Awadhi, Bhojpuri, Braj, Haryanvi, Rajasthani, Korku, Garhwali, Chhattisgarhi, Bhili, Sanskrit, Angika, Bundeli, Kumaoni, Bhadrawahi, Bengali, Gujarati, Panjabi, Nimadi, Kanauji, Malvi, and Uncoded languages
Description:: HinDialect: 26 Hindi-related languages and dialects of the Indic Continuum in North India Languages This is a collection of folksongs for 26 languages that form a dialect continuum in North India and nearby regions. Namely Angika, Awadhi, Baiga, Bengali, Bhadrawahi, Bhili, Bhojpuri, Braj, Bundeli, Chhattisgarhi, Garhwali, Gujarati, Haryanvi, Himachali, Hindi, Kanauji, Khadi Boli, Korku, Kumaoni, Magahi, Malvi, Marathi, Nimadi, Panjabi, Rajasthani, Sanskrit. This data is originally collected by the Kavita Kosh Project at http://www.kavitakosh.org/ . Here are the main characteristics of the languages in this collection: - They are all Indic languages except for Korku. - The majority of them are closely related to the standard Hindi dialect genealogically (such as Hariyanvi and Bhojpuri), although the collection also contains languages such as Bengali and Gujarati which are more distant relatives. - They are all primarily spoken in (North) India (Bengali is also spoken in Bangladesh) - All except Sanksrit are alive languages Data Categorising them by pre-existing available NLP resources, we have: * Band 1 languages : Hindi, Panjabi, Gujarati, Bengali, Nepali. These languages already have other large standard datasets available. Kavita Kosh may have very little data for these languages. * Band 2 languages: Bhojpuri, Magahi, Awadhi, Braj. These languages have growing interest and some datasets of a relatively small size as compared to Band 1 language resources. * Band 3 languages: All other languages in the collection are previously zero-resource languages. These are the languages for which this dataset is the most relevant. Script This dataset is entirely in Devanagari. Content in the case of languages not written in Devanagari (such as Bengali and Gujarati) has been transliterated by the Kavita Kosh Project. Format The dataset contains a single text file containing folksongs per language. Folksongs are separated from each other by an empty line. The first line of a new piece is the title of the folksong, and line separation within folksongs is preserved.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

874. HinDialect: 26 Hindi-related languages and dialects of the Indic Continuum in North India

Creator:: Bafna, Niyati, Žabokrtský, Zdeněk, España-Bonet, Cristina, van Genabith, Josef, Kumar, Lalit "Samyak Lalit", Suman, Sharda, and Shivay, Rahul
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) and Kavita Kosh Project
Type:: text and corpus
Subject:: dialect continuum, dialect variation, Indic, Indo-Aryan, Indian, and Hindi
Language:: Hindi, Marathi, Magahi, Awadhi, Bhojpuri, Braj, Haryanvi, Rajasthani, Korku, Garhwali, Chhattisgarhi, Bhili, Sanskrit, Angika, Bundeli, Kumaoni, Bhadrawahi, Bengali, Gujarati, Panjabi, Nimadi, Kanauji, Malvi, and Uncoded languages
Description:: HinDialect: 26 Hindi-related languages and dialects of the Indic Continuum in North India Languages This is a collection of folksongs for 26 languages that form a dialect continuum in North India and nearby regions. Namely Angika, Awadhi, Baiga, Bengali, Bhadrawahi, Bhili, Bhojpuri, Braj, Bundeli, Chhattisgarhi, Garhwali, Gujarati, Haryanvi, Himachali, Hindi, Kanauji, Khadi Boli, Korku, Kumaoni, Magahi, Malvi, Marathi, Nimadi, Panjabi, Rajasthani, Sanskrit. This data is originally collected by the Kavita Kosh Project at http://www.kavitakosh.org/ . Here are the main characteristics of the languages in this collection: - They are all Indic languages except for Korku. - The majority of them are closely related to the standard Hindi dialect genealogically (such as Hariyanvi and Bhojpuri), although the collection also contains languages such as Bengali and Gujarati which are more distant relatives. - All except Nepali are primarily spoken in (North) India - All except Sanksrit are alive languages Data Categorising them by pre-existing available NLP resources, we have: * Band 1 languages : Hindi, Marathi, Punjabi, Sindhi, Gujarati, Bengali, Nepali. These languages already have other large datasets available. Since Kavita Kosh focusses largely on Hindi-related languages, we may have very little data for these other languages in this particular dataset. * Band 2 languages: Bhojpuri, Magahi, Awadhi, Brajbhasha. These languages have growing interest and some datasets of a relatively small size as compared to Band 1 language resources. * Band 3 languages: All other languages in the collection are previously zero-resource languages. These are the languages for which this dataset is the most relevant. Script This dataset is entirely in Devanagari. Content in the case of languages not written in Devanagari (such as Bengali and Gujarati) has been transliterated by the Kavita Kosh Project. Format The data is segregated by language, and contains each folksong in a different JSON file.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

875. HindMonoCorp 0.5

Creator:: Bojar, Ondřej, Diatka, Vojtěch, Rychlý, Pavel, Straňák, Pavel, Suchomel, Vít, Tamchyna, Aleš, and Zeman, Daniel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: corpus
Language:: Hindi
Description:: Hindi monolingual corpus. It is based primarily on web crawls performed using various tools and at various times. Since the web is a living data source, we treat these crawls as completely separate sources, despite they may overlap. To estimate the magnitude of this overlap, we compared the total number of segments if we concatenate the individual sources (each source being deduplicated on its own) with the number of segments if we de-duplicate all sources to- gether. The difference is just around 1%, confirming, that various web crawls (or their subsequent processings) differ significantly. HindMonoCorp contains data from: Hindi web texts, a monolingual corpus containing mainly Hindi news articles has already been collected and released by Bojar et al. (2008). We use the HTML files as crawled for this corpus in 2010 and we add a small crawl performed in 2013 and re-process them with the current pipeline. These sources are denoted HWT 2010 and HWT 2013 in the following. Hindi corpora in W2C have been collected by Martin Majliš during his project to automatically collect corpora in many languages (Majliš and Žabokrtský, 2012). There are in fact two corpora of Hindi available—one from web harvest (W2C Web) and one from the Wikipedia (W2C Wiki). SpiderLing is a web crawl carried out during November and December 2013 using SpiderLing (Suchomel and Pomikálek, 2012). The pipeline includes extraction of plain texts and deduplication at the level of documents, see below. CommonCrawl is a non-profit organization that regu- larly crawls the web and provides anyone with the data. We are grateful to Christian Buck for extracting plain text Hindi segments from the 2012 and 2013-fall crawls for us. Intercorp – 7 books with their translations scanned and manually alligned per paragraph RSS Feeds from Webdunia.com and the Hindi version of BBC International followed by our custom crawler from September 2013 till January 2014. and LM2010013,
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

876. Historical Corpus of the Welsh Language 1500-1850

Publisher:: University of Cambridge
Format:: application/tei+xml
Type:: corpus
Language:: Welsh
Description:: Welsh texts from the period 1500-1850. Overall the corpus contains around 420,000 words from 30 texts.
Rights:: Not specified

877. HMM tagger

Creator:: Krbec, Pavel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: toolService
Subject:: tagger and morphology
Language:: Czech
Description:: The HMM-based Tagger is a software for morphological disambiguation (tagging) of Czech texts. The algorithm is statistical, based on the Hidden Markov Models.
Rights:: GNU General Public License, version 2, http://www.gnu.org/licenses/gpl-2.0.html, and PUB

878. HNC (Hellenic National Corpus)

Publisher:: Institute for Language and Speech Processing
Format:: application/octet-stream
Type:: corpus
Language:: Modern Greek (1453-)
Description:: General language corpus of standard Modern Greek; 47 MWs
Rights:: Not specified

879. Hocank corpus

Type:: corpus
Description:: Documentation of the Hocank project (DoBeS project)
Rights:: Code of conduct

880. html2text

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Description:: Format conversion service: .html to .txt converter
Rights:: Not specified

« Previous
Next »
1
2
…
84
85
86
87
88
89
90
91
92
…
228
229

871. Hindi Visual Genome 1.1

872. Hindi Web Texts

873. HinDialect 1.1: 26 Hindi-related languages and dialects of the Indic Continuum in North India

874. HinDialect: 26 Hindi-related languages and dialects of the Indic Continuum in North India

875. HindMonoCorp 0.5

876. Historical Corpus of the Welsh Language 1500-1850

877. HMM tagger

878. HNC (Hellenic National Corpus)

879. Hocank corpus

880. html2text

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Show values starting with

Creator

Show values starting with

Format

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Show values starting with

Date

Original context has metadata only

Harvested from