Zobrazit minimální záznam

 
dc.contributor.author Zemánek, Petr
dc.contributor.author Pospíšil, Adam
dc.contributor.author Sellat, Hashem
dc.contributor.author Krubiński, Mateusz
dc.contributor.author Pecina, Pavel
dc.date.accessioned 2024-06-10T08:56:37Z
dc.date.available 2024-06-10T08:56:37Z
dc.date.issued 2023
dc.identifier.uri http://hdl.handle.net/11234/1-5518
dc.description The corpus contains recordings by the native speakers of the North Levantine Arabic (apc) acquired during 2020, 2021, and 2023 in Prague, Paris, Kabardia, and St. Petersburg. Altogether, there were 13 speakers (9 male and 4 female, aged 1x 15-20, 7x 20-30, 4x 30-40, and 1x 40-50). The recordings contain both monologues and dialogues on the topics of everyday life (health, education, family life, sports, culture) as well as information on both host countries (living abroad) and country of origin (Syria traditions, education system, etc.). Both types are spontaneous, the participants were given only the general subject and talked on the topic or discussed it freely. The transcription and translation team consisted of students of Arabic at Charles University, with an additional quality check provided by the native speakers of the dialect. The textual data is split between the (parallel) transcriptions (.apc) and translations (.eng), with one segment per line. The additional .yaml file provides mapping to the corresponding audio file (with the duration and offset in the "%S.%03d" format, i.e., seconds and milliseconds) and a unique speaker ID. The audio data is shared in the 48kHz .wav format, with dialogues and monologues in separate folders. All of the recordings are mono, with a single channel. For dialogues, there is a separate file for each speaker, e.g., "Tar_13052022_Czechia-01.wav" and "Tar_13052022_Czechia-02.wav". The data provided in this repository corresponds to the validation split of the dialectal Arabic to English shared task hosted at the 21st edition of the International Conference on Spoken Language Translation, i.e., IWSLT 2024.
dc.language.iso apc
dc.language.iso eng
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation info:eu-repo/grantAgreement/EC/H2020/870930
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.subject speech corpus
dc.subject speech recognition
dc.subject speech-to-text translation
dc.subject machine translation
dc.subject multilingual
dc.subject Arabic
dc.subject Arabic Corpus
dc.subject north levantine
dc.title UFAL Speech Corpus of North Levantine Arabic 1.0 - Part 1
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType audio
dc.rights.label PUB
has.files yes
branding LINDAT / CLARIAH-CZ
contact.person Pavel Pecina pecina@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
sponsor European Union EC/H2020/870930 WELCOME - Multiple Intelligent Conversation Agent Services for Reception, Management and Integration of Third Country Nationals in the EU euFunds info:eu-repo/grantAgreement/EC/H2020/870930
size.info 152 minutes
files.size 875274515
files.count 4


 Soubory tohoto záznamu

 Stáhnout všechny soubory záznamu (834.73 MB)
Icon
Název
dev2024.eng
Velikost
84.08 KB
Formát
Neznámý
Popis
Speech translation to eng
MD5
c726d877a4abb430323609492df6574c
 Stáhnout soubor
Icon
Název
dev2024.yaml
Velikost
110.33 KB
Formát
Neznámý
Popis
Audio-to-text segment mapping
MD5
5bb2603cc25e326a87e3f586dc7aa01a
 Stáhnout soubor
Icon
Název
dev2024.apc
Velikost
113.6 KB
Formát
Neznámý
Popis
Speech transcription to apc
MD5
6cc17f60993f87235dc0383d377bced4
 Stáhnout soubor
Icon
Název
dev2024_wav.zip
Velikost
834.43 MB
Formát
application/zip
Popis
Audio files
MD5
63e4279f57982ee845d31f61d840e5ac
 Stáhnout soubor  Náhled
 Náhled souboru  
  • Audio-Dialogues
    • Tar_13052022_Czechia-01.wav177 MB
    • Tar_13052022_Food-01.wav116 MB
    • Tar_13052022_Work-02.wav74 MB
    • Tar_13052022_Czechia-02.wav177 MB
    • Tar_13052022_Food-02.wav116 MB
    • Tar_13052022_Work-01.wav74 MB
  • Audio-Monologues
    • Lat_30122020.wav7 MB
    • Dam_06052022_3.wav5 MB
    • Dam_01.wav27 MB
    • Lat_2932021_5.wav19 MB
    • Dam_06052022_2.wav15 MB
    • Lat_26052021.wav77 MB
    • Lat_2932021_4.wav16 MB
    • Lat_2932021_3.wav6 MB
    • Lat_2932021_2.wav15 MB
    • Alep_27052021.wav7 MB
    • Lat_2932021_1.wav37 MB
    • Alep_23122020_4.wav19 MB
    • Alep_23122020_3.wav35 MB
    • Alep_09052021_2.wav3 MB
    • Alep_23122020_2.wav23 MB
    • Alep_09052021_1.wav9 MB
    • Alep_23122020_1.wav16 MB

Zobrazit minimální záznam