HinDialect 1.1: 26 Hindi-related languages and dialects of the Indic Continuum in North India

Name: HinDialect 1.1: 26 Hindi-related languages and dialects of the Indic Continuum in North India
License: http://creativecommons.org/licenses/by-nc-sa/4.0/

Bafna, Niyati; Žabokrtský, Zdeněk; España-Bonet, Cristina; van Genabith, Josef; Kumar, Lalit "Samyak Lalit"; Suman, Sharda; Shivay, Rahul

Show simple item record

dc.contributor.author	Bafna, Niyati
dc.contributor.author	Žabokrtský, Zdeněk
dc.contributor.author	España-Bonet, Cristina
dc.contributor.author	van Genabith, Josef
dc.contributor.author	Kumar, Lalit "Samyak Lalit"
dc.contributor.author	Suman, Sharda
dc.contributor.author	Shivay, Rahul
dc.date.accessioned	2022-09-16T14:57:43Z
dc.date.available	2022-09-16T14:57:43Z
dc.date.issued	2022-07-14
dc.identifier.uri	http://hdl.handle.net/11234/1-4839
dc.description	HinDialect: 26 Hindi-related languages and dialects of the Indic Continuum in North India Languages This is a collection of folksongs for 26 languages that form a dialect continuum in North India and nearby regions. Namely Angika, Awadhi, Baiga, Bengali, Bhadrawahi, Bhili, Bhojpuri, Braj, Bundeli, Chhattisgarhi, Garhwali, Gujarati, Haryanvi, Himachali, Hindi, Kanauji, Khadi Boli, Korku, Kumaoni, Magahi, Malvi, Marathi, Nimadi, Panjabi, Rajasthani, Sanskrit. This data is originally collected by the Kavita Kosh Project at http://www.kavitakosh.org/ . Here are the main characteristics of the languages in this collection: - They are all Indic languages except for Korku. - The majority of them are closely related to the standard Hindi dialect genealogically (such as Hariyanvi and Bhojpuri), although the collection also contains languages such as Bengali and Gujarati which are more distant relatives. - They are all primarily spoken in (North) India (Bengali is also spoken in Bangladesh) - All except Sanksrit are alive languages Data Categorising them by pre-existing available NLP resources, we have: * Band 1 languages : Hindi, Panjabi, Gujarati, Bengali, Nepali. These languages already have other large standard datasets available. Kavita Kosh may have very little data for these languages. * Band 2 languages: Bhojpuri, Magahi, Awadhi, Braj. These languages have growing interest and some datasets of a relatively small size as compared to Band 1 language resources. * Band 3 languages: All other languages in the collection are previously zero-resource languages. These are the languages for which this dataset is the most relevant. Script This dataset is entirely in Devanagari. Content in the case of languages not written in Devanagari (such as Bengali and Gujarati) has been transliterated by the Kavita Kosh Project. Format The dataset contains a single text file containing folksongs per language. Folksongs are separated from each other by an empty line. The first line of a new piece is the title of the folksong, and line separation within folksongs is preserved.
dc.language	Baiga
dc.language	Himachali
dc.language	Khadi Boli
dc.language.iso	hin
dc.language.iso	mar
dc.language.iso	mag
dc.language.iso	awa
dc.language.iso	bho
dc.language.iso	bra
dc.language.iso	bgc
dc.language.iso	raj
dc.language.iso	kfq
dc.language.iso	gbm
dc.language.iso	hne
dc.language.iso	bhb
dc.language.iso	san
dc.language.iso	anp
dc.language.iso	bns
dc.language.iso	kfy
dc.language.iso	bhd
dc.language.iso	ben
dc.language.iso	guj
dc.language.iso	pan
dc.language.iso	noe
dc.language.iso	bjj
dc.language.iso	mup
dc.language.iso	mis
dc.publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.publisher	Kavita Kosh Project
dc.relation.replaces	http://hdl.handle.net/11234/1-4787
dc.rights	Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.source.uri	https://github.com/niyatibafna/north-indian-dialect-modelling
dc.subject	dialect continuum
dc.subject	dialect variation
dc.subject	Indic
dc.subject	Indo-Aryan
dc.subject	Indian
dc.subject	Hindi
dc.title	HinDialect 1.1: 26 Hindi-related languages and dialects of the Indic Continuum in North India
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
dc.rights.label	PUB
has.files	yes
branding	LINDAT / CLARIAH-CZ
contact.person	Niyati Bafna niyatibafna13@gmail.com Universität des Saarlandes
size.info	356037 tokens
files.size	1033077
files.count	1

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Name: HinDialect 1.1.zip
Size: 1008.86 KB
Format: application/zip
Description: Zip archive
MD5: 2952598ed6eec55bd82e75a8fefe443d

Download file Preview

File Preview

HinDialect 1.1
- korku-kfq.txt213 kB
- sanskrit-san.txt3 kB
- panjabi-pan.txt843 kB
- baiga-mis.txt168 kB
- marathi-mar.txt21 kB
- himachali-mis.txt5 kB
- braj-bra.txt116 kB
- nimadi-noe.txt183 kB
- kumaoni-kfy.txt13 kB
- hindi-hin.txt1 kB
- awadhi-awa.txt65 kB
- haryanvi-bgc.txt616 kB
- rajasthani-raj.txt96 kB
- gujarati-guj.txt22 kB
- bhojpuri-bho.txt257 kB
- garhwali-gbm.txt413 kB
- bhadrawahi-bhd.txt12 kB
- magahi-mag.txt462 kB
- khadi_boli-mis.txt56 kB
- angika-anp.txt274 kB
- chhattisgarhi-hne.txt374 kB
- bundeli-bns.txt352 kB
- bengali-ben.txt11 kB
- kanauji-bjj.txt4 kB
- bhili-bhb.txt339 kB
- malvi-mup.txt129 kB

Show simple item record