A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
The package contains Czech recordings of the Visual History Archive which consists of the interviews with the Holocaust survivors. The archive consists of audio recordings, four types of automatic transcripts, manual annotations of selected topics and interviews' metadata. The archive totally contains 353 recordings and 592 hours of interviews.
The data set includes training, development and test data from the shared tasks on pronoun-focused machine translation and cross-lingual pronoun prediction from the EMNLP 2015 workshop on Discourse in Machine Translation (DiscoMT2015). The release also contains the submissions to the pronoun-focused machine translation along with the manual annotations used for the official evaluation as well as gold-standard annotations of pronoun coreference for the shared task test set.
DOESTE v0.5 is a set of developmental corpora of texts written by Brazilian and Portuguese school-age children and adolescents. It is a work in progress.
The texts written by monolingual children and adolescents in European Portuguese were collected between September 2011 and January 2012, from different public schools in Lisbon (Portugal). It is composed of 244 narrative (n=122) and argumentative (n=122) texts. The subjects (51% female and 49% male) are students enroled in the 5th grade (n=52; mean age=10.19), in the 7th grade (n=92; mean age=12.33) and in the 10th grade (n=100; mean age=15.16) from the Portuguese basic schooling. The subcorpus of Portuguese texts is fully tokenized and morphologically annotated, in addition to presenting the sentence occurrences.
The texts written by monolingual children and adolescents in Brazilian Portuguese have been collected since 2017, from different public schools in three cities in Rio Grande do Norte (Brazil). It is currently composed of narrative (n=225) and argumentative (n=225) texts. The subjects (53% female and 47% male) are students enroled in the 5th grade (n=68; mean age=11.13), in the 9th grade (n=82; mean age=15.32) and in the 12th grade (n=224; mean age=17.96) from the Brazilian basic schooling. The subcorpora of Brazilian texts is still in the compilation, but a large part is already searchable, being tokenized and morphologically annotated. The Brazilian subcorpus also presents itself with the original transcripts, along original images.
Portuguese and Brazilian texts were collected from similar tasks:
Narrative-based task: Tell a remarkable story (real or imagined) that you and your best friend lived during the last school vacation.
Argumentative based-task: Do you think social networks (Facebook, Twitter, Google+, Windows Live Space, etc.) are important today? Write a text to be published on your school's blog where you express your opinion on social networks. In this text, you must say whether you are for or against the existence of social networks. Don't forget to justify your opinion!
The next version of DOESTE intends to present semantic annotations and clause and t-unit segmentation.
DOESTE v0.5 is developed and maintained by the Educational Linguistics Research Group (LEd), based at the Federal Rural University of the Semiarid Region (UFERSA).
DOESTE v0.5 by Mário Martins et al. is licensed under CC BY-NC-ND 4.0.
In the recent years, Transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning pipeline and encompasses a series of deduplication mechanisms that together ensure the integrity of both document and paragraph boundaries. Additionally, we maintain both the source web page URL and the WARC shard origin URL in order to complain with EU regulations. esCorpius has been released under CC BY-NC-ND 4.0 license.
"Large Scale Colloquial Persian Dataset" (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. LSCP includes 120M sentences from 27M casual Persian tweets with its dependency relations in syntactic annotation, Part-of-speech tags, sentiment polarity and automatic translation of original Persian sentences in five different languages (EN, CS, DE, IT, HI).
ParCorFull is a parallel corpus annotated with full coreference chains that has been created to address an important problem that machine translation and other multilingual natural language processing (NLP) technologies face -- translation of coreference across languages. Our corpus contains parallel texts for the language pair English-German, two major European languages. Despite being typologically very close, these languages still have systemic differences in the realisation of coreference, and thus pose problems for multilingual coreference resolution and machine translation. Our parallel corpus covers the genres of planned speech (public lectures) and newswire. It is richly annotated for coreference in both languages, including annotation of both nominal coreference and reference to antecedents expressed as clauses, sentences and verb phrases. This resource supports research in the areas of natural language processing, contrastive linguistics and translation studies on the mechanisms involved in coreference translation in order to develop a better understanding of the phenomenon.