Hausa Visual Genome 1.0
-----------------------
http://hdl.handle.net/11234/1-4749

Authors:
  Idris Abdulmumin, Satya Ranjan Dash, Musa Abdullahi Dawud,
  Shantipriya Parida, Shamsuddeen Muhammad, Ibrahim Sa’id Ahmad,
  Subhadarshi Panda, Ondřej Bojar, Bashir Shehu Galadanci, and Bello Shehu


Data
----

Hausa Visual Genome 1.0, a multimodal dataset consisting of text and images
suitable for English-to-Hausa multimodal machine translation tasks and
multimodal research. We follow the same selection of short English segments
(captions) and the associated images from Visual Genome as the dataset Hindi
Visual Genome 1.1 has. We automatically translated the English captions to
Hausa and manually post-edited, taking the associated images into account.

The training set contains 29K segments. Further 1K and 1.6K segments are
provided in development and test sets, respectively, which follow the same
(random) sampling from the original Hindi Visual Genome.

Additionally, a challenge test set of 1400 segments is available for the
multi-modal task. This challenge test set was created in Hindi Visual Genome by
searching for (particularly) ambiguous English words based on the embedding
similarity and manually selecting those where the image helps to resolve the
ambiguity.

Dataset Formats
---------------
The multimodal dataset contains both text and images.

The text parts of the dataset (train and test sets) are in simple tab-delimited
plain text files.

All the text files have seven columns as follows:

Column1 - image_id
Column2 - X
Column3 - Y
Column4 - Width
Column5 - Height
Column6 - English Text
Column7 - Hausa Text

The image part contains the full images with the corresponding image_id as the
file name. The X, Y, Width, and Height columns indicate the rectangular region
in the image described by the caption.

Data Statistics
---------------
The statistics of the current release are given below.

Parallel Corpus Statistics
--------------------------

Dataset       	Segments	English Words	Hausa Words
----------    	--------	-------------	-----------
Train         	   28930	       143106	     140981
Dev           	     998	         4922	       4857
Test          	    1595	         7853	       7736
Challenge Test	    1400	         8186	       8752
----------    	--------	-------------	-----------
Total         	   32923	       164067	     162326

The word counts are approximate, prior to tokenization.

Citation
--------

If you use this corpus, please cite the following paper:

@InProceedings{abdulmumin-EtAl:2022:LREC,
  author    = {Abdulmumin, Idris
          and  Dash, Satya Ranjan
          and  Dawud, Musa Abdullahi
          and  Parida, Shantipriya
          and  Muhammad, Shamsuddeen
          and  Ahmad, Ibrahim Sa'id
          and  Panda, Subhadarshi
          and  Bojar, Ond{\v{r}}ej
          and  Galadanci, Bashir Shehu
          and  Bello, Bello Shehu},
  title     = "{Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation}",
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {June},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {6471--6479},
  url       = {https://aclanthology.org/2022.lrec-1.694}
}

Data publisher: Charles University, Faculty of Mathematics and Physics,
Institute of Formal and Applied Linguistics (UFAL)


Acknowledgment: 
  19-26934X (NEUREM3) of the Czech Science Foundation
  Ministry of Education, Youth and Sports of the Czech Republic, Project No. LM2018101 LINDAT/CLARIAH-CZ.
  Portuguese funding agency, FCT - Fundação para a Ciência e a Tecnologia, within project LA/P/0063/2020