Harvested from: LINDAT/CLARIAH-CZ repository - LINDAT/CLARIAH-CZ Catalog Search Results

611. English-Luganda Parallel Corpus

Publisher:: Center for Dutch Language and Speech, University of Antwerp
Type:: corpus
Language:: English
Description:: Bible. Word-alligned corpus
Rights:: Not specified

612. English-Slovak Parallel Corpus

Creator:: Galuščáková, Petra, Garabík, Radovan, and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: parallel corpus and English-Slovak corpus
Language:: Slovak and English
Description:: English-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] – EMEA, EUConst, KDE4 and PHP) and downloaded website of European Commission [5]. Corpus is published in both in plaintext format and with an automatic morphological annotation. References: [1] http://langtech.jrc.it/JRC-Acquis.html/ [2] http://www.statmt.org/europarl/ [3] http://apertium.eu/data [4] http://opus.lingfil.uu.se/ [5] http://ec.europa.eu/ and This work has been supported by the grant Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003 of the Czech Republic)
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

613. English-Urdu Religious Parallel Corpus

Creator:: Jawaid, Bushra and Zeman, Daniel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: parallel corpus, religious text, and machine translation
Language:: English and Urdu
Description:: English-Urdu parallel corpus is a collection of religious texts (Quran, Bible) in English and Urdu language with sentence alignments. The corpus can be used for experiments with statistical machine translation. Our modifications of crawled data include but are not limited to the following: 1- Manually corrected sentence alignment of the corpora. 2- Our data split (training-development-test) so that our published experiments can be reproduced. 3- Tokenization (optional, but needed to reproduce our experiments). 4- Normalization (optional) of e.g. European vs. Urdu numerals, European vs. Urdu punctuation, removal of Urdu diacritics.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

614. EngVallex - English Valency Lexicon

Creator:: Cinková, Silvie, Fučíková, Eva, Šindlerová, Jana, and Hajič, Jan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, computationalLexicon, and lexicalConceptualResource
Subject:: Annotations, Corpora, Data, Lexicons, Monolingual, Semantics, and Valency
Language:: English
Description:: EngVallex is the English counterpart of the PDT-Vallex valency lexicon, using the same view of valency, valency frames and the description of a surface form of verbal arguments. EngVallex contains links also to PropBank and Verbnet, two existing English predicate-argument lexicons used, i.a., for the PropBank project. The EngVallex lexicon is fully linked to the English side of the PCEDT parallel treebank, which is in fact the PTB re-annotated using the Prague Dependency Treebank style of annotation. The EngVallex is available in an XML format in our repository, and also in a searchable form with examples from the PCEDT.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

615. EngVallex - English Valency Lexicon 2.0

Creator:: Cinková, Silvie, Fučíková, Eva, Šindlerová, Jana, and Hajič, Jan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, computationalLexicon, and lexicalConceptualResource
Subject:: Annotations, corpus, linguistic data, lexicon, lexical semantics, Monolingual, semantics, verbal valency, and valency
Language:: English
Description:: EngVallex 2.0 as a slightly updated version of EngVallex. It is the English counterpart of the PDT-Vallex valency lexicon, using the same view of valency, valency frames and the description of a surface form of verbal arguments. EngVallex contains links also to PropBank (English predicate-argument lexicon). The EngVallex lexicon is fully linked to the English side of the PCEDT parallel treebank(s), which is in fact the PTB re-annotated using the Prague Dependency Treebank style of annotation. The EngVallex is available in an XML format in our repository, and also in a searchable form with examples from the PCEDT. EngVallex 2.0 is the same dataset as the EngVallex lexicon packaged with the PCEDT 3.0 corpus, but published separately under a more permissive licence, avoiding the need for LDC licence which is tied to PCEDT 3.0 as a whole.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

616. Enriched Discourse Annotation of PDiT Subset 1.0 (PDiT-EDA 1.0)

Creator:: Zikánová, Šárka, Synková, Pavlína, and Mírovský, Jiří
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: discourse annotation and implicit discourse relations
Language:: Czech
Description:: Enriched discourse annotation of a subset of the Prague Discourse Treebank, adding implicit relations, entity based relations, question-answer relations and other discourse structuring phenomena.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

617. Enrique Stanko Vráz (explorer)

Creator:: Masarykův lidový ústav and Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: zahrada botanická, Galerie osobností, Places::Praha::Nové Město::Na Slupi::botanická zahrada PF UK, People::Vráz Enrique Stanko (1860-1932), People::Kořenský Josef (1847-1938), People::Domin Karel (1882-1953), and Několik čelných cestovatelů českých
Language:: No linguistic content
Description:: Explorer Enrique Stanko Vráz with his colleague Josef Kořenský and botanist Karel Domin in the Botanical Garden in Prague-Na Slupi in the documentary Několik čelných cestovatelů českých (Leading Czech Explorers, Masaryk´s People´s Institute, 1928).
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

618. EnTam: An English-Tamil Parallel Corpus (EnTam v2.0)

Creator:: Ramasamy, Loganathan, Bojar, Ondřej, and Žabokrtský, Zdeněk
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: parallel corpus
Language:: English and Tamil
Description:: EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpus includes texts from bible, cinema and news domains.
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

619. enTenTen

Creator:: (:unav) Unknown author
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: English large corpus
Language:: English
Description:: Very large English web corpus enTenTEn, comprising 3,268,798,627 tokens. and Lexical Computing Ltd.
Rights:: NLP Centre Web Corpus License, https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC, and ACA

620. esCorpius: A Massive Spanish Crawling Corpus

Creator:: Asier, Gutiérrez-Fandiño, David, Pérez-Fernández, Jordi, Armengol-Estapé, David, Griol, and Zoraida, Callejas
Publisher:: LHF Labs
Type:: text and corpus
Subject:: spanish crawling corpus, crawling corpus, spanish corpus, massive corpus, large corpus, clean, and deduplicated
Language:: Spanish
Description:: In the recent years, Transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning pipeline and encompasses a series of deduplication mechanisms that together ensure the integrity of both document and paragraph boundaries. Additionally, we maintain both the source web page URL and the WARC shard origin URL in order to complain with EU regulations. esCorpius has been released under CC BY-NC-ND 4.0 license.
Rights:: Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB

611. English-Luganda Parallel Corpus

612. English-Slovak Parallel Corpus

613. English-Urdu Religious Parallel Corpus

614. EngVallex - English Valency Lexicon

615. EngVallex - English Valency Lexicon 2.0

616. Enriched Discourse Annotation of PDiT Subset 1.0 (PDiT-EDA 1.0)

617. Enrique Stanko Vráz (explorer)

618. EnTam: An English-Tamil Parallel Corpus (EnTam v2.0)

619. enTenTen

620. esCorpius: A Massive Spanish Crawling Corpus

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Show values starting with

Creator

Show values starting with

Format

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Show values starting with

Date

Original context has metadata only

Harvested from