Czech data - both train and test+eval sets, as well as the valency dictionary - for the CoNLL 2009 Shared Task. Documentation is included. The data are generated from PDT 2.0. LDC catalog number: LDC2009E34B and MSM 0021620838 (http://ufal.mff.cuni.cz:8080/bib/?section=grant&id=116488695895567&mode=view)
Czech trial (example) data for CoNLL 2009 Shared Task. The data are generated from PDT 2.0. LDC2009E32B and MSM 0021620838 (http://ufal.mff.cuni.cz:8080/bib/?section=grant&id=116488695895567&mode=view)
Corpus of texts in 12 languages. For each language, we provide one training, one development and one testing set acquired from Wikipedia articles. Moreover, each language dataset contains (substantially larger) training set collected from (general) Web texts. All sets, except for Wikipedia and Web training sets that can contain similar sentences, are disjoint. Data are segmented into sentences which are further word tokenized.
All data in the corpus contain diacritics. To strip diacritics from them, use Python script diacritization_stripping.py contained within attached stripping_diacritics.zip. This script has two modes. We generally recommend using method called uninames, which for some languages behaves better.
The code for training recurrent neural-network based model for diacritics restoration is located at https://github.com/arahusky/diacritics_restoration.
A slightly modified version of the Czech Wordnet. This is the version used to annotate "The Lexico-Semantic Annotation of PDT using Czech WordNet": http://hdl.handle.net/11858/00-097C-0000-0001-487A-4
The Czech WordNet was developed by the Centre of Natural Language Processing at the Faculty of Informatics, Masaryk University, Czech Republic.
The Czech WordNet captures nouns, verbs, adjectives, and partly adverbs, and contains 23,094 word senses (synsets). 203 of these were created or modified by UFAL during correction of annotations. This version of WordNet was used to annotate word senses in PDT: http://hdl.handle.net/11858/00-097C-0000-0001-487A-4
A more recent version of Czech WordNet is distributed by ELRA: http://catalog.elra.info/product_info.php?products_id=1089 and 1ET201120505, LM2010013
English-Hindi parallel corpus collected from several sources. Tokenized and sentence-aligned. A part of the data is our patch for the Emille parallel corpus. and FP7-ICT-2007-3-231720 (EuroMatrix Plus) 7E09003 (Czech part of EM+)
HindEnCorp parallel texts (sentence-aligned) come from the following sources:
Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was originally col- lected for the DARPA-TIDES surprise-language con- test in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008 (Venkatapathy, 2008).
Commentaries by Daniel Pipes contain 322 articles in English written by a journalist Daniel Pipes and translated into Hindi.
EMILLE. This corpus (Baker et al., 2002) consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual sub- corpora, including both written and (for some lan- guages) spoken data for fourteen South Asian lan- guages. The EMILLE monolingual corpora contain in total 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations into Hindi and other languages.
Smaller datasets as collected by Bojar et al. (2010) include the corpus used at ACL 2005 (a subcorpus of EMILLE), a corpus of named entities from Wikipedia (crawled in 2009), and Agriculture domain parallel corpus.

For the current release, we are extending the parallel corpus using these sources:
Intercorp (Čermák and Rosen,2012) is a large multilingual parallel corpus of 32 languages including Hindi. The central language used for alignment is Czech. Intercorp’s core texts amount to 202 million words. These core texts are most suitable for us because their sentence alignment is manually checked and therefore very reliable. They cover predominately short sto- ries and novels. There are seven Hindi texts in Inter- corp. Unfortunately, only for three of them the English translation is available; the other four are aligned only with Czech texts. The Hindi subcorpus of Intercorp contains 118,000 words in Hindi.
TED talks 3 held in various languages, primarily English, are equipped with transcripts and these are translated into 102 languages. There are 179 talks for which Hindi translation is available.
The Indic multi-parallel corpus (Birch et al., 2011; Post et al., 2012) is a corpus of texts from Wikipedia translated from the respective Indian language into English by non-expert translators hired over Mechanical Turk. The quality is thus somewhat mixed in many respects starting from typesetting and punctuation over capi- talization, spelling, word choice to sentence structure. A little bit of control could be in principle obtained from the fact that every input sentence was translated 4 times. We used the 2012 release of the corpus.
Launchpad.net is a software collaboration platform that hosts many open-source projects and facilitates also collaborative localization of the tools. We downloaded all revisions of all the hosted projects and extracted the localization (.po) files.
Other smaller datasets. This time, we added Wikipedia entities as crawled in 2013 (including any morphological variants of the named entitity that appears on the Hindi variant of the Wikipedia page) and words, word examples and quotes from the Shabdkosh online dictionary. and LM2010013,
A Hindi corpus of texts downloaded mostly from news sites. Contains both the original raw texts and an extensively cleaned-up and tokenized version suitable for language modeling. 18M sentences, 308M tokens and FP7-ICT-2007-3-231720 (EuroMatrix Plus), 7E09003 (Czech part of EM+)
Hindi monolingual corpus. It is based primarily on web crawls performed using various tools and at various times. Since the web is a living data source, we treat these crawls as completely separate sources, despite they may overlap. To estimate the magnitude of this overlap, we compared the total number of segments if we concatenate the individual sources (each source being deduplicated on its own) with the number of segments if we de-duplicate all sources to- gether. The difference is just around 1%, confirming, that various web crawls (or their subsequent processings) differ significantly.
HindMonoCorp contains data from:
Hindi web texts, a monolingual corpus containing mainly Hindi news articles has already been collected and released by Bojar et al. (2008). We use the HTML files as crawled for this corpus in 2010 and we add a small crawl performed in 2013 and re-process them with the current pipeline. These sources are denoted HWT 2010 and HWT 2013 in the following.
Hindi corpora in W2C have been collected by Martin Majliš during his project to automatically collect corpora in many languages (Majliš and Žabokrtský, 2012). There are in fact two corpora of Hindi available—one from web harvest (W2C Web) and one from the Wikipedia (W2C Wiki).
SpiderLing is a web crawl carried out during November and December 2013 using SpiderLing (Suchomel and Pomikálek, 2012). The pipeline includes extraction of plain texts and deduplication at the level of documents, see below.
CommonCrawl is a non-profit organization that regu- larly crawls the web and provides anyone with the data. We are grateful to Christian Buck for extracting plain text Hindi segments from the 2012 and 2013-fall crawls for us.
Intercorp – 7 books with their translations scanned and manually alligned per paragraph
RSS Feeds from Webdunia.com and the Hindi version of BBC International followed by our custom crawler from September 2013 till January 2014. and LM2010013,
This dataset contains annotation of PDT using Czech WordNet ontology: http://hdl.handle.net/11858/00-097C-0000-0001-4880-3
Data is stored in PML format. This is a stand-off annotation and for most use cases it requires PDT 2.0 and the Czech WordNet 1.9 PDT that we have used for annotation. and 1ET100300517, 1ET201120505
One of the goals of LINDAT/CLARIN Centre for Language Research Infrastructure is to provide technical background to institutions or researchers who wants to share their tools and data used for research in linguistics or related research fields. The digital repository is built on a highly customised DSpace platform. and LM2010013 - FULLY SUPPORTED BY THE MINISTRY OF EDUCATION, SPORTS AND YOUTH OF THE CZECH REPUBLIC
One of the goals of LINDAT/CLARIN Centre for Language Research Infrastructure is to provide technical background to institutions or researchers who wants to share their tools and data used for research in linguistics or related research fields. The digital repository is built on a highly customised DSpace platform. and LM2010013 - FULLY SUPPORTED BY THE MINISTRY OF EDUCATION, SPORTS AND YOUTH OF THE CZECH REPUBLIC
This dataset adds annotation of multiword expressions and multiword named entities to the original PDT 2.0 data. The annotation is stand-off, stored in the same PML format as the original PDT 2.0 data. It is to be used together with the PDT 2.0. and grant 1ET201120505 of the Academy of Sciences of the Czech Republic and grant MSM0021620838 of the Ministry of Youth, Education and Sport of The Czech Republic
The ParCzech 3.0 corpus is the third version of ParCzech consisting of stenographic protocols that record the Chamber of Deputies’ meetings held in the 7th term (2013-2017) and the current 8th term (2017-Mar 2021). The protocols are provided in their original HTML format, Parla-CLARIN TEI format, and the format suitable for Automatic Speech Recognition. The corpus is automatically enriched with the morphological, syntactic, and named-entity annotations using the procedures UDPipe 2 and NameTag 2. The audio files are aligned with the texts in the annotated TEI files.
The ParCzech PS7 1.0 corpus is the very first member of the corpus family of data coming from the Parliament of the Czech Republic. ParCzech PS7 1.0 consists of stenographic protocols that record the Chamber of Deputies' meetings held in the 7th term between 2013-2017. The audio recordings are available as well. Transcripts are provided in the original HTML as harvested, and also converted into TEI-derived XML format for use in TEITOK corpus manager. The corpus is automatically enriched with the morphological and named-entity annotations using the procedures MorphoDita and NameTag.
The ParCzech PS7 2.0 corpus is the second version of ParCzech PS7 consisting of stenographic protocols that record the Chamber of Deputies' meetings held in the 7th term between 2013-2017. The protocols are provided in their original HTML format, TEI format and TEI-derived format to make them searchable in the TEITOK corpus manager. Their audio recordings are available as well. The corpus is automatically enriched with the morphological, syntactic, and named-entity annotations using the procedures UDPipe 2 and NameTag 2.
A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated release of the existing PDT-corpora of Czech data, uniformly annotated using the standard PDT scheme. PDT-corpora included in PDT-C: Prague Dependency Treebank (the original PDT contents, written newspaper and journal texts from three genres); Czech part of Prague Czech-English Dependency Treebank (translated financial texts, from English), Prague Dependency Treebank of Spoken Czech (spoken data, including audio and transcripts and multiple speech reconstruction annotation); PDT-Faust (user-generated texts). The difference from the separately published original treebanks can be briefly described as follows: it is published in one package, to allow easier data handling for all the datasets; the data is enhanced with a manual linguistic annotation at the morphological layer and new version of morphological dictionary is enclosed; a common valency lexicon for all four original parts is enclosed. Documentation provides two browsing and editing desktop tools (TrEd and MEd) and the corpus is also available online for searching using PML-TQ.
The Prague Dependency Treebank 2.5 annotates the same texts as the PDT 2.0. The annotation on the original four layers was fixed or improved in various aspects (see Documentation). Moreover, new information was added to the data:
Annotation of multiword expressions
Pair/group meaning
Clause segmentation and Ministry of Education of the Czech Republic projects No.:
LM2010013
LC536
MSM0021620838
Grant Agency of the Czech Republic grants No.:
P406/2010/0875
P202/10/1333
P406/10/P193
The Prague Dependency Treebank 3.5 is the 2018 edition of the core Prague Dependency Treebank (PDT). It contains all PDT annotation made at the Institute of Formal and Applied Linguistics under various projects between 1996 and 2018 on the original texts, i.e., all annotation from PDT 1.0, PDT 2.0, PDT 2.5, PDT 3.0, PDiT 1.0 and PDiT 2.0, plus corrections, new structure of basic documentation and new list of authors covering all previous editions. The Prague Dependency Treebank 3.5 (PDT 3.5) contains the same texts as the previous versions since 2.0; there are 49,431 annotated sentences (832,823 words) on all layers, from tectogrammatical annotation to syntax to morphology. There are additional annotated sentences for syntax and morphology; the totals for the lower layers of annotation are: 87,913 sentences with 1,502,976 words at the analytical layer (surface dependency syntax) and 115,844 sentences with 1,956,693 words at the morphological layer of annotation (these totals include the annotation with the higher layers annotated as well). Closely linked to the tectogrammatical layer is the annotation of sentence information structure, multiword expressions, coreference, bridging relations and discourse relations.
ROMi represents a specific subcorpus of CZESL (Czech as a Second Language). It collects examples of language use, both spoken and written, of Czech Romani children and teen-agers. The range of materials exceeds 1,5 million words.
Language Material
The material presents uses of spoken language by language-specific group of Romani speakers using Czech as their first language. However, this form of the language is specifically different from Czech as used by the Czech-speaking majority, both on the spoken and secondarily on the written level. It concerns the so-called Romani ethnolect of Czech, i.e. a variety of Czech used by Romani communities mainly in the Czech Republic. We may detect obvious influence of Romani, Slovak and Hungarian. Furthermore, many of the recorded speakers live in social exclusion and thus their language production is influenced by both factors, i.e. by Romani ethnolect and social exclusion.
The language material was collected in the years 2009 – 2012 under the Education for Competitiveness Operational Programme, within the framework of the project Innovations of Czech as a Second Language Education collaboratively by the Technical University of Liberec and the Institute of Czech Language and Theory of Communication, Faculty of Arts, Charles University. The language material was processed with support of Institute of Formal and Applied Linguistics - project LINDAT-Clarin.
It concerns 110 recordings obtained in various environments – the collection of material took place both in schools and also in several non-profit organizations offering leisure time activities to Romani students. Apart from the school setting, the recordings thus come from the environment of extracurricular activities, sport matches and households. Both the respondents and the collectors are Romani. The samples were acquired in all regions of the Czech Republic, although the majority of recordings were obtained in the Central Bohemia, South Bohemia, Ústí and Vysočina Region. The age of the respondents ranges from 12 to 28 years. The collected samples are also accompanied by metadata relating to the following areas:
The collected samples are accompanied by metadata relating to the following areas:
• The place of origin (the place of collection, the size of the residence and dialect area, region, environment (school, extracurricular, private); socially excluded locality.
• The circumstances of the collection expressing the extent of control exercised by the collector (topic assigned/non-assigned).
• The respondent (the age of the student; class/year; sex; type of the school; subjective knowledge of Romani; first language – the one the student considers to be his first; communicative environment in the family – which language(s) is/are used for communication in the family.
• The place of data collection – in the case of schools metadata comprise characteristics of the type of school (primary, for students with special needs, remedial, vocational, secondary), the founder (state, church, private organisation), in the case of the place of individual collection of data you may find organisation, interest group markings, etc.
• The collector (the abbreviation of collector´s name and his work area, in some cases also his age).
Delimiting the group of respondents
The respondents are constituted by students of primary schools, schools for students with special needs, secondary schools and by teenagers who have just completed the compulsory education. For the purposes of the language material collection, those students who consider themselves to be Romani or who are considered Romani by others were included to the sample. Moreover, a language criterion was added to this definition - thus those students in whose families Romani is spoken at home were also included. Active knowledge of the Romani language was not required since hardly a third of Romani children living in the Czech Republic nowadays is competent in this language.
Ethical aspects of the data collection and processing
As regards the content of the language material, it places demands on the data processing from the ethical point of view. Frequently, the texts and recordings feature highly interesting material; the respondents talk about their life stories fully distant or inconceivable for the social majority. During the transcription process, all materials are anonymized and identification data are removed.
Field Research
When dealing with the environment threatened by social exclusion, it is highly important to consider especially the needs and opportunities of the group members as well as the needs of those individuals, who find themselves or work in such an environment. During the developmental process of the corpus, we became decidedly convinced that it is necessary to accommodate different demands on material quality of texts and recordings and not to overburden both the respondents and the collectors with limiting or impossible requirements. Therefore, the corpus comprises several recordings of lower technical quality which were acquired in the presence of other persons, with the television turned on, etc. Firstly, the recordings would not even have come into existence under different circumstances – it is natural that the interviewing of younger children was taking place directly in their households, in the presence of their parents. Secondly, the recordings would have been made, yet they would have been influenced by the unnaturalness of the situation, consequently affecting the language material. Apart from the interviews with younger children, it regards especially those conversations between the collectros and their peers, e.g. inside leisure time clubs.
Characteristics of the recordings
The collected recordings come both from the school environment (especially conversations of teacher assistants with individual students) and from the leisure time facilities (interest groups, after-school tutoring). In most cases it concerns conversations of the collector and the individual, alternatively a pair of respondents. The length of the recordings differs, although the majority ranges from 20 to 35 minutes. A single recording approximately contains 2 495 words. The quality of recordings is influenced by the limits of field-utilizable technologies and the effort to increase authenticity to the maximum.
Transcription of the recordings
The rules for transcription of the recordings are based on similar ones designed for SCHOLA corpus. Transcriptions are carried out by the means of folkloristic transcription, i.e. the closest to the written record, especially adapted for the purposes of computational processing, following the practice established in the Czech National Corpus. The transcription is performed with the help of the Transcriber programme, which connects the sound and graphic track.