Number of results to display per page
Search Results
32. El Emperador Carlos V y su corte :
- Type:
- text and korespondence
- Subject:
- Dějiny států a území na Pyrenejském poloostrově, Dějiny zemí střední Evropy, Ferdinand, Salinas, Martín de,, Karel, edice, rukopisy, korespondence, panovníci habsburské monarchie, panovníci němečtí, vyslanci, Habsburská monarchie, Španělsko, zahraniční politika, mezinárodní vztahy, and světové dějiny 1492-1648
- Language:
- Spanish
- Rights:
- unknown
33. El Misionero Checo Miguel Sabel y el comercio del cristal de Bohemia en América :
- Creator:
- Štěpánek, Pavel,
- Type:
- text and studie
- Subject:
- Dějiny křesťanské církve, Sabel, Miguel,, řád, jezuité, misionáři, sklo, ikonografie, světové dějiny 1789-1918, Venezuela, jednotlivci (církevní dějiny), obchod, and české země 1620-1740
- Language:
- Spanish
- Rights:
- unknown
34. Engineering job ads corpus
- Creator:
- Cardenas Acosta, Ronald, Bello Medina, Kevin, Coronado, Alberto, and Villota, Elizabeth
- Publisher:
- National University of Engineering, Peru
- Type:
- text and corpus
- Subject:
- job-advertisement, PoS tagging, and text corpora
- Language:
- Spanish
- Description:
- The corpus presented consists of job ads in Spanish related to Engineering positions in Peru. The documents were preprocessed and annotated for POS tagging, NER, and topic modeling tasks. The corpus is divided in two components: - POS tagging/ NER training data: Consisting of 800 job ads, each one tokenized and manually annotated with POS tag information (EAGLE format) and Entity Label in BIO format. - Topic modeling training data: containing 9000 documents stripped from stopwords. Comes in two formats: * Whole text documents: containing all the information originally posted in the ad. * Extracted chunks documents: containing chunks extracted by custom NER models (expected skills, tasks to perform, and preferred major), as described in Improving Topic Coherence Using Entity Extraction Denoising (to appear)
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
35. esCorpius: A Massive Spanish Crawling Corpus
- Creator:
- Asier, Gutiérrez-Fandiño, David, Pérez-Fernández, Jordi, Armengol-Estapé, David, Griol, and Zoraida, Callejas
- Publisher:
- LHF Labs
- Type:
- text and corpus
- Subject:
- spanish crawling corpus, crawling corpus, spanish corpus, massive corpus, large corpus, clean, and deduplicated
- Language:
- Spanish
- Description:
- In the recent years, Transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning pipeline and encompasses a series of deduplication mechanisms that together ensure the integrity of both document and paragraph boundaries. Additionally, we maintain both the source web page URL and the WARC shard origin URL in order to complain with EU regulations. esCorpius has been released under CC BY-NC-ND 4.0 license.
- Rights:
- Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB
36. Europarl QTLeap WSD/NED corpus
- Creator:
- Agirre, Eneko, Branco, António, Popel, Martin, and Simov, Kiril
- Publisher:
- University of the Basque Country, UPV/EHU, Faculty of Science, Univeristy of Lisbon, FCUL, Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL), and Bulgarian Academy of Sciences, IICT-BAS
- Type:
- text and corpus
- Subject:
- annotated corpus and multilingual
- Language:
- Basque, Bulgarian, Czech, English, Portuguese, and Spanish
- Description:
- This corpora is part of Deliverable 5.5 of the European Commission project QTLeap FP7-ICT-2013.4.1-610516 (http://qtleap.eu). The texts are sentences from the Europarl parallel corpus (Koehn, 2005). We selected the monolingual sentences from parallel corpora for the following pairs: Bulgarian-English, Czech-English, Portuguese-English and Spanish-English. The English corpus is comprised by the English side of the Spanish-English corpus. Basque is not in Europarl. In addition, it contains the Basque and English sides of the GNOME corpus. The texts have been automatically annotated with NLP tools, including Word Sense Disambiguation, Named Entity Disambiguation and Coreference resolution. Please check deliverable D5.6 in http://qtleap.eu/deliverables for more information.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
37. Extended CLEF eHealth 2013-2015 IR Test Collection
- Creator:
- Pecina, Pavel and Saleh, Shadi
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- cross-lingual information retrieval and machine translation
- Language:
- English, Czech, French, German, Hungarian, Polish, Spanish, and Swedish
- Description:
- This package contains an extended version of the test collection used in the CLEF eHealth Information Retrieval tasks in 2013--2015. Compared to the original version, it provides complete query translations into Czech, French, German, Hungarian, Polish, Spanish and Swedish and additional relevance assessment.
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
38. HamleDT 2.0
- Creator:
- Zeman, Daniel, Mareček, David, Mašek, Jan, Popel, Martin, Ramasamy, Loganathan, Rosa, Rudolf, Štěpánek, Jan, and Žabokrtský, Zdeněk
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- treebank, Stanford dependencies, Prague dependencies, harmonization, common annotation style, and Interset
- Language:
- Arabic, Bulgarian, Bengali, Catalan, Czech, Danish, German, Modern Greek (1453-), English, Spanish, Estonian, Basque, Persian, Finnish, Ancient Greek (to 1453), Hindi, Hungarian, Italian, Japanese, Latin, Dutch, Portuguese, Romanian, Russian, Slovak, Slovenian, Swedish, Tamil, Telugu, and Turkish
- Description:
- HamleDT 2.0 is a collection of 30 existing treebanks harmonized into a common annotation style, the Prague Dependencies, and further transformed into Stanford Dependencies, a treebank annotation style that became popular recently. We use the newest basic Universal Stanford Dependencies, without added language-specific subtypes.
- Rights:
- HamleDT 2.0 Licence Agreement, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-hamledt-2.0, and ACA
39. HamleDT 3.0
- Creator:
- Zeman, Daniel, Mareček, David, Mašek, Jan, Popel, Martin, Ramasamy, Loganathan, Rosa, Rudolf, Štěpánek, Jan, and Žabokrtský, Zdeněk
- Publisher:
- Charles University
- Type:
- text and corpus
- Subject:
- annotated corpus, morphology, syntax, dependency, treebank, harmonized annotation, and common annotation style
- Language:
- Arabic, Basque, Bengali, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Modern Greek (1453-), Ancient Greek (to 1453), Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Latin, Persian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Tamil, Telugu, and Turkish
- Description:
- HamleDT (HArmonized Multi-LanguagE Dependency Treebank) is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. This version uses Universal Dependencies as the common annotation style. Update (November 1017): for a current collection of harmonized dependency treebanks, we recommend using the Universal Dependencies (UD). All of the corpora that are distributed in HamleDT in full are also part of the UD project; only some corpora from the Patch group (where HamleDT provides only the harmonizing scripts but not the full corpus data) are available in HamleDT but not in UD.
- Rights:
- HamleDT 3.0 License Terms, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-hamledt-3.0, and PUB
40. Historia de España y de la civilización española.
- Creator:
- Altamira y Crevea, Rafael,
- Type:
- text and monografie
- Subject:
- Dějiny států a území na Pyrenejském poloostrově, dějiny států, Španělsko, přehledná zpracování světových dějin (chronologicky), and přehledná zpracování (tematicky)
- Language:
- Spanish
- Rights:
- unknown