An XML-based file containing the electronic version of al wassit dictionary. An Arabic monolingual dictionary accomplished by the Academy of the Arabic Language in Cairo
An LMF conformant XML-based file containing the electronic version of al wassit dictionary. An Arabic monolingual dictionary accomplished by the Academy of the Arabic Language in Cairo
A special edition of the Elektajournal from 1933 dedicated to the fifteenth anniversary of the Czechoslovak Republic. The introduction consists of a short retrospective composed of archival materials obtained in the streets of Prague on 28 October 1918. Footage of the parade on Wenceslaus Square on 28 October 1933. Raising the Czechoslovak flag in front of the Municipal House on Republic Square. Philips Radio vehicle with loudspeakers on the roof. Philips microphone. President Tomáš Garrigue Masaryk and Minister of National Defence Bohumír Bradáč during a military parade on Wenceslaus Square. The ceremonial parade heads down 28 October Street and National Avenue to Smetana Embankment. MPs František Soukup, František Staněk and Prime Minister Jan Malypetr watch the ceremony from a grandstand by the Rudolfinum. Shots of a Czechoslovak Army military parade; the parade consists of infantry, soldiers with bicycles, horse-drawn cannons and armoured cars. Flyover by military aircraft. Parade of scouts in front of President Masaryk in the third courtyard of Prague Castle.
The segment from a Degl film production company newsreel captures the 50th anniversary celebrations of laying of the cornerstone of the National Theatre. The celebrations took place on 16-18 May 1918 in Prague, with the participation of representatives of all Slavic nations of the Austro-Hungarian Empire. The first shots show the festively decorated building of the National Theatre. In the next part, the camera observes the events taking place in the upper part of Wenceslas Square. The staircase and the ramp of the National Museum, where the opening ceremony took place (specifically in its Pantheon), are filled with young people in national folk costumes. Shots of the crowded square. Cultural and political figures, such as poets Adolf Heyduk and Pavol Országh Hviezdoslav, writer Alois Jirásek and the head of the National Theatre Opera Karel Kovařovic, are leaving the building of the National Museum. This is followed by the symbolic ceremonial removal of politician Karel Kramář from the building. Afterwards, the Slovenian writer and mayor of Ljubljana Ivan Tavčar is seen leaving the building, as well as Czech actors Eduard Vojan, Marie Hübnerová, Leopolda Dostalová, Marie Laudová-Hořicová, Karel Želenský, writers Ignát Herrmann, František Herites and Jan Herben with his wife Bronislava, poet Bohdan Kaminský, politicians Alois Rašín, František Soukup, Gustav Habrman, Václav Klofáč and other notable national figures.
The segment captures the celebration of the fifth anniversary of the Czechoslovak Republic held in Prague on 28 October 1923. Festivities by the Statue of St Wenceslaus on Wenceslaus Square. Karel Kramář stands at the rostrum. Crowds gathered on the square wave their hats.
Segment from Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) 1942, issue no. 17, captures the presentation of a gift Ï Ambulance Train no. 751 Ï from the Protectorate of Bohemia and Moravia to Adolf Hitler and the German army. The train handover took place at Prague Main Railway Station on 20 April 1942, the birthday of Adolf Hitler. Cars arrive in front of Prague Main Railway Station. Acting Reich Protector Reinhard Heydrich enters the train station. State President Emil Hácha gives a speech in the festively decorated railway hall. In response, Heydrich shakes his hand. The event is witnessed by a delegation of railway workers. The train crew lines up on the station platform. Heydrich enters the train with his entourage and inspects the sleeping cars, the operating carriage, the kitchen, and the sick bay. The inspection of the ambulance train is attended by Protectorate Prime Minister Jaroslav Krejčí and Minister of Education and People´s Enlightenment Emanuel Moravec. According to the voiceover, the train was made in a railway workshop in Prague-Bubny in record time. It consisted of 28 carriages and 20 hospital carriages, was 410 metres long, weighed 545 tons and had capacity for 280 wounded.
A Gold Standard Word Alignment for English-Swedish (GES) is a resource containing 1164 manually word aligned sentences pairs from English and Swedish versions of Europarl v. 2.
This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
This is an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification.
These are supplementary materials for an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification and is available at http://hdl.handle.net/11234/1-4615. These supplementary materials contain OCR texts from different OCR engines for book pages for which we have both high-resolution scanned images and annotations for OCR evaluation.
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) 1942, issue no. 27A, captures the Pledge of Czech Theatre Professionals´ Allegiance to the Reich, a manifestation held at the National Theatre in Prague on 25 June 1942, which was to unequivocally condemn the assassination of Acting Reich Protector Reinhard Heydrich. Speeches are delivered by actor Rudolf Deyl Jr. and Minister of Education and People´s Enlightenment Emanuel Moravec (silent). Actress Růžena Nasková and actors Karel Höger, Ferenc Futurista, and Stanislav Neumann are seen among the participants. The segment concludes with everyone performing the Nazi salute.
A morphological layer for the German part of the SMULTRON corpus. Layer was annotated according to the STTS tagset and the annotation guidelines of the Tiger corpus.
Coordinator: Thomas Müller
Annotators: Francesca Caratti, Arne Recknagel
This distribution contains a morphological layer for the SMULTRON corpus [0].
The annotation process is described in :
@InProceedings{mueller2015,
author = {M\"uller, Thomas and Sch\"utze, Hinrich},
title = {Robust Morphological Tagging with Word Representations},
booktitle = {Proceedings of NAACL},
year = {2015},
}
[0] http://www.cl.uzh.ch/research/parallelcorpora/paralleltreebanks/smultron_en.html
This small dataset contains 3 speech corpora collected using the Alex Translate telephone service (https://ufal.mff.cuni.cz/alex#alex-translate).
The "part1" and "part2" corpora contain English speech with transcriptions and Czech translations. These recordings were collected from users of the service. Part 1 contains earlier recordings, filtered to include only clean speech; Part 2 contains later recordings with no filtering applied.
The "cstest" corpus contains recordings of artificially created sentences, each containing one or more Czech names of places in the Czech Republic. These were recorded by a multinational group of students studying in Prague.
We present a test corpus of audio recordings and transcriptions of presentations of students' enterprises together with their slides and web-pages. The corpus is intended for evaluation of automatic speech recognition (ASR) systems, especially in conditions where the prior availability of in-domain vocabulary and named entities is benefitable.
The corpus consists of 39 presentations in English, each up to 90 seconds long, and slides and web-pages in Czech, Slovak, English, German, Romanian, Italian or Spanish.
The speakers are high school students from European countries with English as their second language.
We benchmark three baseline ASR systems on the corpus and show their imperfection.
Segment from Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) 1941, issue no. 40, captures events linked to the accession of SS-Obergruppenführer Reinhard Heydrich to the office of Deputy Reich Protector of the Protectorate of Bohemia and Moravia on 27 September 1941. Heydrich attends an SS military parade on Hradčanské Square in Prague. Military dignitaries and state officials welcome him in the first quadrangle of Prague Castle. The Nazi flag flies over Prague Castle. Reich Commissioner for the Sudetenland Konrad Henlein and Reich Secretary Karl Hermann Frank are present at the occasion. State President Emil Hácha receives Reinhard Heydrich at Prague Castle.
An image of actor Ada Karlovský in an unidentified film. Karlovský with his colleague Terezie Javůrková in Pražští adamité (The Prague Adamites, dir. Antonín Fencl, 1917).
Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.org/wmt11/test.tgz), translated from the German originals. Original segmentation of the WMT 2011 data is preserved. and This project has been sponsored by the grants GAČR P406/11/1499 and EuroMatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003+7E11051 of the Ministry of Education, Youth and Sports of the Czech Republic)
This xml file describes the Arabic phonetic constraints are to be applied on Arabic root. The first rule category lists the letters that may not occur in the same root, regardless of their order. The second category lists the letters that may not be used together in a root word with a specific order. The third and fourth categories show that each contiguous letters must not be redundant
ISLRN: 991-445-325-823-5
Lexical network AdjDeriNet consists of pairs of base adjectives and their derivatives. It contains nearly 18 thousand base adjectives that are base words for more than 26 thousand lexemes of several parts of speech.
Painter Adolf Hoffmeister on Bohumil Veselý's balcony. Hoffmeister in a fragmented segment from Československé filmové noviny (Czechoslovak Film News) 1952, issue no. 26.
Phonological networks are representations of word forms and their phonological relationships with other words in a given language lexicon. A principle underlying the growth (or evolution) of those networks is preferential attachment, or the ‘rich-gets-richer’ mechanisms, according to which words with many phonological neighbors (or links) are the main beneficiaries of future growth opportunities. Due to their limited number of words, language lexica constitute node-constrained networks where growth cannot keep increasing in a linear way; hence, preferential attachment is likely mitigated by certain factors. The present study investigated aging effects (i.e., a word’s finite time span of being active in terms of growth) in an evolving phonological network of English as a second language. It was found that phonological neighborhoods are constructed by one large initial lexical spurt, followed by sublinear growth spurts that eventually lead to very limited growth in later lexical spurts during network evolution, all the while obeying the law of preferential attachment. An analysis of the strength of phonological relationships between phonological word forms revealed a tendency to attach more distant phonological neighbors in the lower proficiency levels, while phonologically more similar neighbors enter phonological neighborhoods at more advanced levels of English as a second language. Overall, the findings suggest an aging effect in growth that favors younger words. In addition, beginning learners seem to prefer the acquisition of phonological neighbors that are easier to discriminate. Implications for the second language lexicon include leveraged learning mechanisms, learning bouts focussed on a smaller range of phonological segments, and involve questions concerning lexical processing in aging networks.
Corpus contains recordings of communication between air traffic controllers and pilots. The speech is manually transcribed and labeled with the information about the speaker (pilot/controller, not the full identity of the person). The corpus is currently small (20 hours) but we plan to search for additional data next year. The audio data format is: 8kHz, 16bit PCM, mono. and Technology Agency of the Czech Republic, project No. TA01030476.
Corpus AKCES 2 consists of trancripts of recordings of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora). It is the same data as the corpus "Schola 2010" (see the link for search), but all the proper names have been removed in order to protect the privacy of participants. and MŠMT (MSM0021620825), UK (PRVOUK P 10)
Corpus AKCES 2 ver. 2 consists of full, unabridged trancripts of recordings of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora). It is the same data as the corpus "Schola 2010" (see the link for search), but all the proper names have been removed in order to protect the privacy of participants. and UK, PRVOUK P10
Corpus AKCES 3 includes texts written in czech by non-native speakers (AKCES/CLAC - Czech Language Acquisition Corpora) and ESF (OPVK CZ.1.07/2.2.00/07.0259), MŠMT (MSM0021620825), UK (P10)
Corpus AKCES 4 includes texts written in czech by youth growing up in locations at risk of social exclusion (AKCES/CLAC - Czech Language Acquisition Corpora) and ESF (OPVK CZ.1.07/2.2.00/07.0259), MŠMT (MSM0021620825), UK (P10)
Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition Corpora. CzeSL-SGT stands for Czech as a Second Language with Spelling, Grammar and Tags. Extends the “foreign” (ciz) part of AKCES 3 (CzeSL-plain) by texts collected in 2013. Original forms and automatic corrections are tagged, lemmatized and assigned erros labels. Most texts have metadata attributes (30 items) about the author and the text.
Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition Corpora. CzeSL-SGT stands for Czech as a Second Language with Spelling, Grammar and Tags. Extends the “foreign” (ciz) part of AKCES 3 (CzeSL-plain) by texts collected in 2013. Original forms and automatic corrections are tagged, lemmatized and assigned erros labels. Most texts have metadata attributes (30 items) about the author and the text.
In addition to a few minor bugs, fixes a critical issue in Release 1: the native speakers of Ukrainian (s_L1:"uk") were wrongly labelled as speakers of "other European languages" (s_L1_group="IE"), instead of speakers of a Slavic language (s_L1_group="S"). The file is now a regular XML document, with all annotation represented as XML attributes.
AKCES-GEC is a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format.
Note that in comparison to CZESL-GEC dataset, this dataset contains separated edits together with their type annotations in M2 format and also has two times more sentences.
If you use this dataset, please use following citation:
@article{naplava2019wnut,
title={Grammatical Error Correction in Low-Resource Scenarios},
author={N{\'a}plava, Jakub and Straka, Milan},
journal={arXiv preprint arXiv:1910.00353},
year={2019}
}
AlbMoRe is a sentiment analysis corpus of movie reviews in Albanian, consisting of 800 records in CSV format. Each record includes a text review retrieved from IMDb and translated in Albanian by the author. It also contains a 0 negative) or 1 (positive) label added by the author. The corpus is fully balanced, consisting of 400 positive and 400 negative reviews about 67 movies of different genres. AlbMoRe corpus is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). If using the data, please cite the following paper: Çano Erion. AlbMoRe: A Corpus of Movie Reviews for Sentiment Analysis in Albanian. CoRR, abs/2306.08526, 2023. URL https://arxiv.org/abs/2306.08526.
AlbNER is a Named Entity Recognition corpus of Wikipedia sentences in Albanian, consisting of 900 records. The sentence tokens are manually labeled complying with the CoNLL-2003 shared task annotation scheme explained at https://aclanthology.org/W03-0419.pdf that uses I-ORG, B-ORG, I-PER, B-PER, I-LOC, B-LOC, I-MISC, B-MISC and O tags. AlbNER data are released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). If using AlbMoRe corpus, please cite the following paper: Çano Erion. AlbNER: A Corpus for Named Entity Recognition in Albanian. CoRR, abs/2309.08741, 2023. URL https://arxiv.org/abs/2309.08741.
AlbNews is a topic modeling corpus of news headlines in Albanian, consisting of 600 labeled samples and 2600 unlabeled samples. Each labeled sample includes a headline text retrieved from Albanian online news portals. It also contains one of the four labels: 'pol' for politics, 'cul' for culture, 'eco' for economy, and 'spo' for sport. Each of the unlabeled samples contain a headline text only.AlbTopic corpus is released under CC-BY 4.0 license (https://creativecommons.org/licenses/by/4.0/). If using the data, please cite the following paper:
Çano Erion, Lamaj Dario. AlbNews: A Corpus of Headlines for Topic Modeling in Albanian. CoRR, abs/2402.04028, 2024. URL: https://arxiv.org/abs/2402.04028.
A dataset intended for fully trainable natural language generation (NLG) systems in task-oriented spoken dialogue systems (SDS), covering the English public transport information domain. It includes preceding context (user utterance) along with each data instance (pair of source meaning representation and target natural language paraphrase to be generated).
Taking the form of the previous user utterance into account for generating the system response allows NLG systems trained on this dataset to entrain (adapt) to the preceding utterance, i.e., reuse wording and syntactic structure. This should presumably improve the perceived naturalness of the output, and may even lead to a higher task success rate.
Crowdsourcing has been used to obtain natural context user utterances as well as natural system responses to be generated.
Painter Alfons Mucha at Zbiroh Chateau working on The Battle of Grünwald from his Slav Epic cycle. Mucha in his studio working on a design for the windows of St. Vitus Cathedral. Mucha in the garden of his villa in Prague-Bubeneč. Mucha with his wife Marie (née Chytilová), son Jiří, and daughter Jaroslava. Mucha with painters Max Švabinský and Alois Kalvoda.