A Hindi corpus of texts downloaded mostly from news sites. Contains both the original raw texts and an extensively cleaned-up and tokenized version suitable for language modeling. 18M sentences, 308M tokens and FP7-ICT-2007-3-231720 (EuroMatrix Plus), 7E09003 (Czech part of EM+)
Ministerstvo školství, mládeže a tělovýchovy České republiky@@7E09003@@EuroMatrixPlus – Bringing Machine Translation for European Languages to the User@@nationalFunds@@✖[remove]1