Data
----
Hindi Visual Genome 1.0, a multimodal dataset consisting of text and images suitable for English-to-Hindi multimodal machine translation task and multimodal research. We have selected short English segments (captions) from Visual Genome along with associated images and automatically translated them to Hindi with manual post-editing, taking the associated images into account. The training set contains 29K segments. Further 1K and 1.6K segments are provided in a development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome.
Additionally, a challenge test set of 1400 segments will be released for the WAT2019 multi-modal task. This challenge test set was created by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity.
Dataset Formats
--------------
The multimodal dataset contains both text and images.
The text parts of the dataset (train and test sets) are in simple tab-delimited plain text files.
All the text files have seven columns as follows:
Column1 - image_id
Column2 - X
Column3 - Y
Column4 - Width
Column5 - Height
Column6 - English Text
Column7 - Hindi Text
The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width and Height columns indicate the rectangular region in the image described by the caption.
Data Statistics
----------------
The statistics of the current release is given below.
Parallel Corpus Statistics
---------------------------
Dataset Segments English Words Hindi Words
------- --------- ---------------- -------------
Train 28932 143178 136722
Dev 998 4922 4695
Test 1595 7852 7535
Challenge Test 1400 8185 8665 (Released separately)
------- --------- ---------------- -------------
Total 32925 164137 157617
The word counts are approximate, prior to tokenization.
Citation
--------
If you use this corpus, please cite the following paper:
@article{hindi-visual-genome:2019,
title={{Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation}},
author={Parida, Shantipriya and Bojar, Ond{\v{r}}ej and Dash, Satya Ranjan},
journal={Computaci{\'o}n y Sistemas},
note={In print. Presented at CICLing 2019, La Rochelle, France},
year={2019},
}
Hindi monolingual corpus. It is based primarily on web crawls performed using various tools and at various times. Since the web is a living data source, we treat these crawls as completely separate sources, despite they may overlap. To estimate the magnitude of this overlap, we compared the total number of segments if we concatenate the individual sources (each source being deduplicated on its own) with the number of segments if we de-duplicate all sources to- gether. The difference is just around 1%, confirming, that various web crawls (or their subsequent processings) differ significantly.
HindMonoCorp contains data from:
Hindi web texts, a monolingual corpus containing mainly Hindi news articles has already been collected and released by Bojar et al. (2008). We use the HTML files as crawled for this corpus in 2010 and we add a small crawl performed in 2013 and re-process them with the current pipeline. These sources are denoted HWT 2010 and HWT 2013 in the following.
Hindi corpora in W2C have been collected by Martin Majliš during his project to automatically collect corpora in many languages (Majliš and Žabokrtský, 2012). There are in fact two corpora of Hindi available—one from web harvest (W2C Web) and one from the Wikipedia (W2C Wiki).
SpiderLing is a web crawl carried out during November and December 2013 using SpiderLing (Suchomel and Pomikálek, 2012). The pipeline includes extraction of plain texts and deduplication at the level of documents, see below.
CommonCrawl is a non-profit organization that regu- larly crawls the web and provides anyone with the data. We are grateful to Christian Buck for extracting plain text Hindi segments from the 2012 and 2013-fall crawls for us.
Intercorp – 7 books with their translations scanned and manually alligned per paragraph
RSS Feeds from Webdunia.com and the Hindi version of BBC International followed by our custom crawler from September 2013 till January 2014. and LM2010013,
A petition for a referendum (called: "Schluss mit Gendersprache in Verwaltung und Bildung" / eng.: "abolition of gender language in administration and education") was formed in Hamburg in February 2023. The project "Empirical Gender Linguistics" at the "Leibniz Institute for the German Language" took this as an opportunity to completely scrap the "https://www.hamburg.de" website (except the list of ships in the Port of Hamburg and the yellow page). The Hamburg.de website is the central digital contact point for citizens. The scraped texts were cleaned, processed and annotated using http://www.CorpusExplorer.de (TreeTagger - POS/Lemma information).
We use the corpus to analyze the use of words with gender signs.
The article deals with the development of analytic approaches to the use of specific tenses in the indicative mood in German subordinate content clauses, as presented in German linguistics. The author presents the results of her own research based on the examples of subordinate content clauses found in the Mannheim corpus of German texts. According to the latest German scholarship, there are two principles governing the tense distribution in the indicative mood in subordinate content clauses introduced by a verb in the past tense: 1. the perspective of speaker 1, i.e. the point of view of the characters in the story, 2. the perspective of speaker 2, i.e. the point of view of the narrator. The first principle is comparable to the principle governing the use of the tenses in subordinate content clauses in Slavic languages, the second principle is comparable to the sequence of tenses used in English and other Germanic languages. The first principle finds its use more in the spoken or non-standard discourse, the second one is typical for standard German. The present paper focuses on sentences consisting of a past-tense main clause and one embedded content clause that allows the alternation between present tense and preterite (...dass sie schwanger ist vs. ...dass sie schwanger war), as attested in the Mannheim corpus. The analysis essentially confirms the existing approaches and theories but it also brings new findings, which call for adjusting the current views and which pose new questions for more comprehensive corpus-based research. and Článek se zabývá vývojem názorů německé lingvistiky na problematiku užití indikativních časů v nepřímé řeči, resp. šířeji ve větách obsahových v němčině. Ve druhé části článku jsou představeny výsledky krátkého výzkumu, který byl proveden na příkladech vět obsahových získaných z korpusu mannheimského IDS. Podle nejnovějších publikovaných poznatků existují v němčině dva způsoby užití indikativních časů ve větách obsahových uvozených slovesem v minulém čase: a) z pozice mluvčího 1, tj. z perspektivy postav; b) z pozice mluvčího 2, tj. z perspektivy vypravěče. První způsob je srovnatelný se způsobem užití časů ve větách obsahových ve slovanských jazycích, druhý způsob je srovnatelný s časovou sousledností v angličtině a dalších germán-ských jazycích. První způsob se více uplatňuje v mluveném či nespisovném projevu, druhý způsob je typický pro němčinu spisovnou. Analýza souboru vět s řídící větou v minulém čase a jedinou konkrétní závislou větou obsahovou, v níž si konkurují prézens a préteritum (...dass sie schwanger ist vs. ...dass sie schwanger war) z mannheimského korpusu v podstatě potvrdily dosavadní poznatky a teorie, nicméně přinášejí i některé korigující informace, resp. poznatky podněcující k rozsáhlejšímu korpusovému výzkumu.
This corpus consists of full transcriptions of both Democratic and Republican 2016 presidential candidate debates, with a special focus on the idiolects of Hillary Clinton and Donald Trump against the background of the speeches of other candidates for the post of president of the United States.
The transcriptions are sourced from the American Presidency Project at the University of California, Santa Barbara. Any use of the material requires a prior and explicit written permission by the project administrator (contact policy@ucsb.edu). This corpus material is now being shared with their kindly permission.
Indonesian text corpus from web. Crawling done by SpiderLing in 2017. Filtering by JusText and Onion (see http://corpus.tools/ for details). Tagged and lemmatized by MorphInd (http://septinalarasati.com/morphind/).
This paper works with data provided by the Czech National Corpus to consider the use of nepřizpůsobivý (inadaptable) by the Czech mainstream print media as a code word that is widely understood to signify a Roma citizen. The study shows that nepřizpůsobivý is used far more frequently in journalism than in other text genres and that its use has increased over the past decade. Examination of collocations reveals that nepřizpůsobivý typically is associated with negative reports on housing, residency and crime. This paper can also be seen as a case study to illustrate the usefulness of corpus data to critical discourse analysis and the role of the corpus in providing quantitative support to qualitative research in general. and Článek založený na datech z Českého národního korpusu zkoumá užívání slova nepřizpůsobivý v hlavních českých denících. Nepřizpůsobivý je ve skutečnosti užíváno jako zástupné slovo pro Romy / romskou populaci. Výzkum ukazuje, že toto slovo se používá daleko častěji v rámci publicistiky než v jiných typech textů a že jeho frekvence v posledních deseti letech výrazně stoupla. Kolokační analýza odhaluje, že slovo nepřizpů-sobivý se typicky vyskytuje v negativních kontextech v novinových článcích o bytech a bydlení obecně, o soužití občanů a o kriminalitě. Tento článek může být nahlížen i jako případová studie, která je příkladem využití korpusových dat v kritické analýze diskurzu a zároveň dokládá roli korpusu v poskytování kvantitativní opory v rámci kvalitativního lingvistického výzkumu.
General consensus in linguistics is that language context (or ''co-text'') plays crucial role in describing linguistic properties of language items. Isolated units are, as a corollary to this statement, inherently ambiguous (polysemous and/or polyfunctional). In this paper we describe the most influential forces leading to disambiguation of language units, specifically the role of n-gram length on its ambiguity.
This editor was developed especially for the needs of the KAMOKO project (https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-3261). The editor allows the quick entry of example sentences and sentence variants as well as the corresponding speaker ratings.