Number of results to display per page
Search Results
862. Helsinki Finite State Technology
- Publisher:
- University of Helsinki
- Type:
- toolService
- Subject:
- finite state transducer and morphological analyzer
- Description:
- The Helsinki Finite-State Transducer software is intended for the implementation of morphological analysers and other tools which are based on weighted and unweigted finite-state transducer technology. The feasibility of the HFST toolkit has been demonstrated by full-fledged open source implementations of Finnish, Swedish, English, French and Northern Sámi lexicons.
- Rights:
- Not specified
863. Helsinki Finite-State Technology
- Publisher:
- University of Helsinki
- Type:
- toolService
- Description:
- The Helsinki Finite-State Transducer software is intended for the implementation of morphological analysers and other tools which are based on weighted and unweigted finite-state transducer technology. The feasibility of the HFST toolkit has been demonstrated by full-fledged open source implementations of Finnish, Swedish, English, French and Northern Sámi lexicons.
- Rights:
- Not specified
864. Herders Conversations-Lexikon
- Type:
- lexicalConceptualResource
- Subject:
- Germanistik
- Language:
- German
- Description:
- 1. Aufl. 1854-1857; disziplinübergreifende Darstellung von Gegenstandsbereichen gesellschaftlicher Konversation
- Rights:
- Not specified
865. Heřman Šikl (anatomist)
- Creator:
- Veselý, Bohumil
- Publisher:
- Národní filmový archiv
- Type:
- video and clip
- Subject:
- Galerie osobností, Places::Praha::Nové Město::Školská::pavlač domu, and People::Šikl Heřman (1888-1955)
- Language:
- No linguistic content
- Description:
- Anatom profesor Heřman Šikl na pavlači Bohumila Veselého.
- Rights:
- http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
866. Heřman Zeffi (composer)
- Creator:
- Veselý, Bohumil
- Publisher:
- Národní filmový archiv
- Type:
- video and clip
- Subject:
- pejsek, Galerie osobností, Places::Praha::Nové Město::Školská::pavlač domu, and People::Zeffi Heřman (1872-1955)
- Language:
- No linguistic content
- Description:
- Composer Heřman Zeffi on Bohumil Veselý's balcony.
- Rights:
- http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
867. HetWiK: Heterogene Widerstandskulturen
- Creator:
- Schuster, Britt-Marie, Markewitz, Friedrich, Wilk, Nicole M., Schröder, Sarah, and Rüdiger, Jan Oliver
- Publisher:
- Universität Paderborn
- Type:
- text and corpus
- Subject:
- corpus, annotated corpus, Widerstand, Widerstandskorpus, Jüngere Sprachgeschichte, Kommunikationsgeschichte, Nationalsozialismus, Sprachliche Praktiken, Soziale Identität, Beziehungskonstitution, Faktizitätsherstellung, Argumentieren, Direktiva, resistance, resistance corpus, recent language history, communication history, National Socialism, linguistic practices, social identity, relationship formation, creating facticity, argumentation, directive speech acts, and speech act
- Language:
- German
- Description:
- The representative full-text digitalized HetWiK corpus is composed of 140 manually annotated texts of the German Resistance between 1933 and 1945. This includes both well-known and relatively unknown documents, public writings, like pamphlets or memoranda, as well as private texts, e.g. letters, journal or prison entries and biographies. Thus the corpus represents the diverse groups as well as the heterogeneity of verbal resistance and allows the study of resistance in relation to the language usage. The HetWiK corpus can be used free of charge. A detailed register of the individual texts and further information about the tagset can be found on the project-homepage (german). In addition to the CATMA5 XML-format we provide a standoff-JSON format and CEC6-Files (CorpusExplorer) - so you can export the HetWiK corpus in different formats.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
868. High-Coverage Multi-Level Text Corpus for Non-Professional Voice Conservation
- Creator:
- Jůzová, Markéta, Tihelka, Daniel, and Matoušek, Jindřich
- Publisher:
- University of West Bohemia, Department of Cybernetics
- Type:
- text and corpus
- Subject:
- text-to-speech (TTS), voice conservation, voice banking, and text corpus
- Language:
- Czech
- Description:
- This text corpus contains a carefully optimized set of sentences that could be used in the process of preparing a speech corpus for the development of personalized text-to-speech system. It was designed primarily for the voice conservation procedure that must be performed in a relatively short period before a person loses his/her own voice, typically because of the total laryngectomy. Total laryngectomy is a radical treatment procedure which is often unavoidable to save life of patients who were diagnosed with severe laryngeal cancer. In spite of being very effective with respect to the primary treatment, it significantly handicaps the patients due to the permanent loss of their ability to use voice and produce speech. Luckily, the modern methods of computer text-to-speech (TTS) synthesis offer a possibility for "digital conservation" of patient's original voice for his/her future speech communication -- a procedure called voice banking or voice conservation. Moreover, the banking procedure can be undertaken by any person facing voice degradation or loss in farther future, or who is simply is willing to keep his/her voice-print.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
869. HindEnCorp 0.5
- Creator:
- Bojar, Ondřej, Diatka, Vojtěch, Straňák, Pavel, Tamchyna, Aleš, and Zeman, Daniel
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- parallel corpus, English-Hindi parallel corpus, and sentence-parallel
- Language:
- Hindi and English
- Description:
- HindEnCorp parallel texts (sentence-aligned) come from the following sources: Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was originally col- lected for the DARPA-TIDES surprise-language con- test in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008 (Venkatapathy, 2008). Commentaries by Daniel Pipes contain 322 articles in English written by a journalist Daniel Pipes and translated into Hindi. EMILLE. This corpus (Baker et al., 2002) consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual sub- corpora, including both written and (for some lan- guages) spoken data for fourteen South Asian lan- guages. The EMILLE monolingual corpora contain in total 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations into Hindi and other languages. Smaller datasets as collected by Bojar et al. (2010) include the corpus used at ACL 2005 (a subcorpus of EMILLE), a corpus of named entities from Wikipedia (crawled in 2009), and Agriculture domain parallel corpus.  For the current release, we are extending the parallel corpus using these sources: Intercorp (Čermák and Rosen,2012) is a large multilingual parallel corpus of 32 languages including Hindi. The central language used for alignment is Czech. Intercorp’s core texts amount to 202 million words. These core texts are most suitable for us because their sentence alignment is manually checked and therefore very reliable. They cover predominately short sto- ries and novels. There are seven Hindi texts in Inter- corp. Unfortunately, only for three of them the English translation is available; the other four are aligned only with Czech texts. The Hindi subcorpus of Intercorp contains 118,000 words in Hindi. TED talks 3 held in various languages, primarily English, are equipped with transcripts and these are translated into 102 languages. There are 179 talks for which Hindi translation is available. The Indic multi-parallel corpus (Birch et al., 2011; Post et al., 2012) is a corpus of texts from Wikipedia translated from the respective Indian language into English by non-expert translators hired over Mechanical Turk. The quality is thus somewhat mixed in many respects starting from typesetting and punctuation over capi- talization, spelling, word choice to sentence structure. A little bit of control could be in principle obtained from the fact that every input sentence was translated 4 times. We used the 2012 release of the corpus. Launchpad.net is a software collaboration platform that hosts many open-source projects and facilitates also collaborative localization of the tools. We downloaded all revisions of all the hosted projects and extracted the localization (.po) files. Other smaller datasets. This time, we added Wikipedia entities as crawled in 2013 (including any morphological variants of the named entitity that appears on the Hindi variant of the Wikipedia page) and words, word examples and quotes from the Shabdkosh online dictionary. and LM2010013,
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
870. Hindi Visual Genome 1.0
- Creator:
- Parida, Shantipriya and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- image and corpus
- Subject:
- parallel corpus, corpus, multilingual, machine translation, shared task, English-Hindi parallel corpus, image captioning, and multi-modal
- Language:
- English and Hindi
- Description:
- Data ---- Hindi Visual Genome 1.0, a multimodal dataset consisting of text and images suitable for English-to-Hindi multimodal machine translation task and multimodal research. We have selected short English segments (captions) from Visual Genome along with associated images and automatically translated them to Hindi with manual post-editing, taking the associated images into account. The training set contains 29K segments. Further 1K and 1.6K segments are provided in a development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome. Additionally, a challenge test set of 1400 segments will be released for the WAT2019 multi-modal task. This challenge test set was created by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity. Dataset Formats -------------- The multimodal dataset contains both text and images. The text parts of the dataset (train and test sets) are in simple tab-delimited plain text files. All the text files have seven columns as follows: Column1 - image_id Column2 - X Column3 - Y Column4 - Width Column5 - Height Column6 - English Text Column7 - Hindi Text The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width and Height columns indicate the rectangular region in the image described by the caption. Data Statistics ---------------- The statistics of the current release is given below. Parallel Corpus Statistics --------------------------- Dataset Segments English Words Hindi Words ------- --------- ---------------- ------------- Train 28932 143178 136722 Dev 998 4922 4695 Test 1595 7852 7535 Challenge Test 1400 8185 8665 (Released separately) ------- --------- ---------------- ------------- Total 32925 164137 157617 The word counts are approximate, prior to tokenization. Citation -------- If you use this corpus, please cite the following paper: @article{hindi-visual-genome:2019, title={{Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation}}, author={Parida, Shantipriya and Bojar, Ond{\v{r}}ej and Dash, Satya Ranjan}, journal={Computaci{\'o}n y Sistemas}, note={In print. Presented at CICLing 2019, La Rochelle, France}, year={2019}, }
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB