Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) / Rights: PUB

101. EdUKate translation software 1

Creator:: Popel, Martin, Novák, Michal, Balhar, Jiří, Košarko, Ondřej, Mayer, Jiří, Poláková, Lucie, Kloudová, Věra, and Anisimova, Mariia
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: tool and toolService
Subject:: machine translation and Ukrainian
Language:: Ukrainian and Czech
Description:: This software package includes three tools: web frontend for machine translation featuring phonetic transcription of Ukrainian suitable for Czech speakers, API server and a tool for translation of documents with markup (html, docx, odt, pptx, odp,...). These tools are used in the Charles Translator service (https://translator.cuni.cz). This software was developed within the EdUKate project, which aims to help mitigate language barriers between non-Czech-speaking children in the Czech Republic and the education in the Czech school system. The project focuses on the development and dissemination of multilingual digital learning materials for students in primary and secondary schools.
Rights:: BSD 2-Clause "Simplified" or "FreeBSD" license, http://opensource.org/licenses/BSD-2-Clause, and PUB

102. EFCL Channelizer

Creator:: Klusáček, David
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: tool and toolService
Subject:: Fast Channelizer, Filterbank, ASR Front End, Software Defined Radio, Polyphase Filter, Frequency Multiplexing, Audio Denoising, High Performance Computing, HPC, SDR, FFT, FFTW, SIMD, AVX, SSE, and NEON
Description:: Extremely fast digital audio channelizer implementation, usable as a building block for experimental ASR front-ends or signal denoising applications. Also applicable in software defined radios, due to its high throughput. It comes in a form of a C/C++ library and an executable example program which reads input stream, splitting it into equidistant frequency channels, emitting their data to the output. Features: (1) Hand tuned SIMD-aware assembly for x86 (SSE) and IA64 (AVX) as well as for ARM (NEON) processors. (2) Generic non-SIMD C++ implementation for other architectures. (3) Capable of taking advantage of multicore CPUs. (4) Fully configurable number of channels and the output decimation rate. (5) User supplied FIR of the channel separation filter, which allows to specify the width of the channels, whether they should overlap or be separated. (6) Input and output signal samples are treated as complex numbers. (7) Speed over 750 complex MS/s achieved on Core i7 4710HQ @ 2.5GHz, when channelizing into 72 output channels with a FIR length of 1152 samples, using 3 computing threads. (8) Runs under Linux OS.
Rights:: Mozilla Public License 2.0, http://opensource.org/licenses/MPL-2.0, and PUB

103. ELITR Minuting Corpus

Creator:: Nedoluzhko, Anna, Singh, Muskaan, Hledíková, Marie, Tirthankar, Ghosal, and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: summarization, minuting, and meeting minutes
Language:: English and Czech
Description:: ELITR Minuting Corpus consists of transcripts of meetings in Czech and English, their manually created summaries ("minutes") and manual alignments between the two. Czech meetings are in the computer science and public administration domains and English meetings are in the computer science domain. Each transcript has one or multiple corresponding minutes files. Alignments are only provided for a portion of the data. This corpus contains 59 Czech and 120 English meeting transcripts, consisting of 71097 and 87322 dialogue turns respectively. For Czech meetings, we provide 147 total minutes with 55 of them aligned. For English meetings, it is 256 total minutes with 111 of them aligned. Please find a more detailed description of the data in the included README and stats.tsv files. If you use this corpus, please cite: Nedoluzhko, A., Singh, M., Hledíková, M., Ghosal, T., and Bojar, O. (2022). ELITR Minuting Corpus: A novel dataset for automatic minuting from multi-party meetings in English and Czech. In Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC-2022), Marseille, France, June. European Language Resources Association (ELRA). In print. @inproceedings{elitr-minuting-corpus:2022, author = {Anna Nedoluzhko and Muskaan Singh and Marie Hled{\'{\i}}kov{\'{a}} and Tirthankar Ghosal and Ond{\v{r}}ej Bojar}, title = {{ELITR} {M}inuting {C}orpus: {A} Novel Dataset for Automatic Minuting from Multi-Party Meetings in {E}nglish and {C}zech}, booktitle = {Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC-2022)}, year = 2022, month = {June}, address = {Marseille, France}, publisher = {European Language Resources Association (ELRA)}, note = {In print.} }
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

104. EMMT (Eyetracked Multi-Modal Translation)

Creator:: Bhattacharya, Sunit, Kloudová, Věra, Zouhar, Vilém, and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: audio and corpus
Subject:: sight translation and multi-modal
Language:: English and Czech
Description:: Eyetracked Multi-Modal Translation (EMMT) is a simultaneous eye-tracking, 4-electrode EEG and audio corpus for multi-modal reading and translation scenarios. It contains monocular eye movement recordings, audio data and 4-electrode wearable electroencephalogram (EEG) data of 43 participants while engaged in sight translation supported by an image. The details about the experiment and the dataset can be found in the README file.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

105. English Model (CoNLL-2003) for NameTag

Creator:: Straka, Milan and Straková, Jana
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, mlmodel, and languageDescription
Subject:: NameTag, English, and named entity recognition
Language:: English
Description:: English model for NameTag, a named entity recognition tool. The model is trained on CoNLL-2003 training data. Recognizes PER, ORG, LOC and MISC named entities. Achieves F-measure 84.73 on CoNLL-2003 test data.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

106. English Models (Morphium + WSJ) for MorphoDiTa

Creator:: Straka, Milan and Straková, Jana
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, languageDescription, and mlmodel
Subject:: MorphoDiTa, English, morphological analysis, morphological generation, and PoS tagging
Language:: English
Description:: English models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging. The morphological dictionary is created from Morphium and SCOWL (Spell Checker Oriented Word Lists), the PoS tagger is trained on WSJ (Wall Street Journal). and This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013). The morphological POS analyzer development was supported by grant of the Ministry of Education, Youth and Sports of the Czech Republic No. LC536 "Center for Computational Linguistics". The morphological POS analyzer research was performed by Johanka Spoustová (Spoustová 2008; the Treex::Tool::EnglishMorpho::Analysis Perl module). The lemmatizer was implemented by Martin Popel (Popel 2009; the Treex::Tool::EnglishMorpho::Lemmatizer Perl module). The lemmatizer is based on morpha, which was released under LGPL licence as a part of RASP system (http://ilexir.co.uk/applications/rasp). The tagger algorithm and feature set research was supported by the projects MSM0021620838 and LC536 of Ministry of Education, Youth and Sports of the Czech Republic, GA405/09/0278 of the Grant Agency of the Czech Republic and 1ET101120503 of Academy of Sciences of the Czech Republic. The research was performed by Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab and Miroslav Spousta.
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

107. English-Hindi Parallel Corpus

Creator:: Bojar, Ondřej, Straňák, Pavel, Zeman, Daniel, Jain, Gaurav, and Damani, Om Prakesh
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: English-Hindi parallel corpus and parallel corpus
Language:: Hindi and English
Description:: English-Hindi parallel corpus collected from several sources. Tokenized and sentence-aligned. A part of the data is our patch for the Emille parallel corpus. and FP7-ICT-2007-3-231720 (EuroMatrix Plus) 7E09003 (Czech part of EM+)
Rights:: Creative Commons - Attribution 3.0 Unported (CC BY 3.0), http://creativecommons.org/licenses/by/3.0/, and PUB

108. English-Slovak Parallel Corpus

Creator:: Galuščáková, Petra, Garabík, Radovan, and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: parallel corpus and English-Slovak corpus
Language:: Slovak and English
Description:: English-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] – EMEA, EUConst, KDE4 and PHP) and downloaded website of European Commission [5]. Corpus is published in both in plaintext format and with an automatic morphological annotation. References: [1] http://langtech.jrc.it/JRC-Acquis.html/ [2] http://www.statmt.org/europarl/ [3] http://apertium.eu/data [4] http://opus.lingfil.uu.se/ [5] http://ec.europa.eu/ and This work has been supported by the grant Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003 of the Czech Republic)
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

109. English-Urdu Religious Parallel Corpus

Creator:: Jawaid, Bushra and Zeman, Daniel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: parallel corpus, religious text, and machine translation
Language:: English and Urdu
Description:: English-Urdu parallel corpus is a collection of religious texts (Quran, Bible) in English and Urdu language with sentence alignments. The corpus can be used for experiments with statistical machine translation. Our modifications of crawled data include but are not limited to the following: 1- Manually corrected sentence alignment of the corpora. 2- Our data split (training-development-test) so that our published experiments can be reproduced. 3- Tokenization (optional, but needed to reproduce our experiments). 4- Normalization (optional) of e.g. European vs. Urdu numerals, European vs. Urdu punctuation, removal of Urdu diacritics.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

110. EngVallex - English Valency Lexicon

Creator:: Cinková, Silvie, Fučíková, Eva, Šindlerová, Jana, and Hajič, Jan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, computationalLexicon, and lexicalConceptualResource
Subject:: Annotations, Corpora, Data, Lexicons, Monolingual, Semantics, and Valency
Language:: English
Description:: EngVallex is the English counterpart of the PDT-Vallex valency lexicon, using the same view of valency, valency frames and the description of a surface form of verbal arguments. EngVallex contains links also to PropBank and Verbnet, two existing English predicate-argument lexicons used, i.a., for the PropBank project. The EngVallex lexicon is fully linked to the English side of the PCEDT parallel treebank, which is in fact the PTB re-annotated using the Prague Dependency Treebank style of annotation. The EngVallex is available in an XML format in our repository, and also in a searchable form with examples from the PCEDT.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

101. EdUKate translation software 1

102. EFCL Channelizer

103. ELITR Minuting Corpus

104. EMMT (Eyetracked Multi-Modal Translation)

105. English Model (CoNLL-2003) for NameTag

106. English Models (Morphium + WSJ) for MorphoDiTa

107. English-Hindi Parallel Corpus

108. English-Slovak Parallel Corpus

109. English-Urdu Religious Parallel Corpus

110. EngVallex - English Valency Lexicon

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Creator

Show values starting with

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Show values starting with

Date

Original context has metadata only

Harvested from