Harvested from: LINDAT/CLARIAH-CZ repository - LINDAT/CLARIAH-CZ Catalog Search Results

Start Over Harvested from LINDAT/CLARIAH-CZ repository

621. EnTam: An English-Tamil Parallel Corpus (EnTam v2.0)

Creator:: Ramasamy, Loganathan, Bojar, Ondřej, and Žabokrtský, Zdeněk
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: parallel corpus
Language:: English and Tamil
Description:: EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpus includes texts from bible, cinema and news domains.
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

622. enTenTen

Creator:: (:unav) Unknown author
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: English large corpus
Language:: English
Description:: Very large English web corpus enTenTEn, comprising 3,268,798,627 tokens. and Lexical Computing Ltd.
Rights:: NLP Centre Web Corpus License, https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC, and ACA

623. esCorpius: A Massive Spanish Crawling Corpus

Creator:: Asier, Gutiérrez-Fandiño, David, Pérez-Fernández, Jordi, Armengol-Estapé, David, Griol, and Zoraida, Callejas
Publisher:: LHF Labs
Type:: text and corpus
Subject:: spanish crawling corpus, crawling corpus, spanish corpus, massive corpus, large corpus, clean, and deduplicated
Language:: Spanish
Description:: In the recent years, Transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning pipeline and encompasses a series of deduplication mechanisms that together ensure the integrity of both document and paragraph boundaries. Additionally, we maintain both the source web page URL and the WARC shard origin URL in order to complain with EU regulations. esCorpius has been released under CC BY-NC-ND 4.0 license.
Rights:: Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB

624. ESIC 1.0 -- Europarl Simultaneous Interpreting Corpus

Creator:: Macháček, Dominik, Žilinec, Matúš, and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: audio and corpus
Subject:: simultaneous interpreting, interpreting, ASR evaluation, automatic machine translation evaluation, and Europarl
Language:: English, Czech, and German
Description:: ESIC (Europarl Simultaneous Interpreting Corpus) is a corpus of 370 speeches (10 hours) in English, with manual transcripts, transcribed simultaneous interpreting into Czech and German, and parallel translations. The corpus contains source English videos and audios. The interpreters' voices are not published within the corpus, but there is a tool that downloads them from the web of European Parliament, where they are publicly avaiable. The transcripts are equipped with metadata (disfluencies, mixing voices and languages, read or spontaneous speech, etc.), punctuated, and with word-level timestamps. The speeches in the corpus come from the European Parliament plenary sessions, from the period 2008-11. Most of the speakers are MEP, both native and non-native speakers of English. The corpus contains metadata about the speakers (name, surname, id, fraction) and about the speech (date, topic, read or spontaneous). The current version of ESIC is v1.0. It has validation and evaluation parts.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

625. ESIC 1.1 -- Europarl Simultaneous Interpreting Corpus (2024-02-05)

Creator:: Macháček, Dominik, Žilinec, Matúš, and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: audio and corpus
Subject:: simultaneous interpreting, interpreting, ASR evaluation, automatic machine translation evaluation, and Europarl
Language:: English, Czech, and German
Description:: ESIC (Europarl Simultaneous Interpreting Corpus) is a corpus of 370 speeches (10 hours) in English, with manual transcripts, transcribed simultaneous interpreting into Czech and German, and parallel translations. The corpus contains source English videos and audios. The interpreters' voices are not published within the corpus, but there is a tool that downloads them from the web of European Parliament, where they are publicly avaiable. The transcripts are equipped with metadata (disfluencies, mixing voices and languages, read or spontaneous speech, etc.), punctuated, and with word-level timestamps. The speeches in the corpus come from the European Parliament plenary sessions, from the period 2008-11. Most of the speakers are MEP, both native and non-native speakers of English. The corpus contains metadata about the speakers (name, surname, id, fraction) and about the speech (date, topic, read or spontaneous). ESIC has validation and evaluation parts. The current version is ESIC v1.1, it extends v1.0 with manual sentence alignment of the tri-parallel texts, and with bi-parallel sentence alignment of English original transcripts and German interpreting.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

626. Estació Terminus

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Language:: Catalan and Spanish
Description:: Tool for terminology management.
Rights:: Not specified

627. ESTEN

Publisher:: Centre de Terminologia TERMCAT and Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Language:: Catalan
Description:: Terminology management.
Rights:: Not specified

628. Estonian Dialect Corpus

Publisher:: University of Tartu
Format:: application/octet-stream
Type:: corpus
Language:: Estonian
Description:: Recordings of different Estonian dialects, 900000 words, transcribed and partly (400000 words) morphologically annotated
Rights:: Not specified

629. Estonian Frequency Dictionary

Publisher:: University of Tartu
Format:: text/plain
Type:: lexicalConceptualResource
Language:: Estonian
Description:: 10000 most frequent lemmas, 1000 most frequent word forms, based on 1 million words of journals and fiction
Rights:: Not specified

630. Estonian Reference Corpus

Publisher:: University of Tartu
Format:: application/tei+xml
Type:: corpus
Language:: Estonian
Description:: Collection of Estonian texts (divided into subcorpora); ca 175 million words; TEI
Rights:: Not specified

« Previous
Next »
1
2
…
59
60
61
62
63
64
65
66
67
…
228
229

621. EnTam: An English-Tamil Parallel Corpus (EnTam v2.0)

622. enTenTen

623. esCorpius: A Massive Spanish Crawling Corpus

624. ESIC 1.0 -- Europarl Simultaneous Interpreting Corpus

625. ESIC 1.1 -- Europarl Simultaneous Interpreting Corpus (2024-02-05)

626. Estació Terminus

627. ESTEN

628. Estonian Dialect Corpus

629. Estonian Frequency Dictionary

630. Estonian Reference Corpus

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Show values starting with

Creator

Show values starting with

Format

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Show values starting with

Date

Original context has metadata only

Harvested from