Rights: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) / Type: text

1. AKCES-GEC Grammatical Error Correction Dataset for Czech

Creator:: Šebesta, Karel, Bedřichová, Zuzanna, Šormová, Kateřina, Štindlová, Barbora, Hrdlička, Milan, Hrdličková, Tereza, Hana, Jiří, Petkevič, Vladimír, Jelínek, Tomáš, Škodová, Svatava, Janeš, Petr, Lundáková, Kateřina, Skoumalová, Hana, Sládek, Šimon, Pierscieniak, Piotr, Toufarová, Dagmar, Straka, Milan, Rosen, Alexandr, Náplava, Jakub, and Poláčková, Marie
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: natural language correction, grammatical error correction, and gec
Language:: Czech
Description:: AKCES-GEC is a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format. Note that in comparison to CZESL-GEC dataset, this dataset contains separated edits together with their type annotations in M2 format and also has two times more sentences. If you use this dataset, please use following citation: @article{naplava2019wnut, title={Grammatical Error Correction in Low-Resource Scenarios}, author={N{\'a}plava, Jakub and Straka, Milan}, journal={arXiv preprint arXiv:1910.00353}, year={2019} }
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

2. Arabic Triliteral roots Lexicon

Creator:: Mustafa, Ebtihal and Bouzoubaa, Karim
Publisher:: MDPI language jurnal
Type:: text, lexicon, and lexicalConceptualResource
Subject:: Arabic language, Arabic roots, lexicons, phonetic system, bigram frequencies, and roots weight.
Language:: Arabic
Description:: Description: This xml file is a lexicon containing all 21952 (28x28x28) Arabic triliteral combinations (roots). the file is split into three parts as follow: the first part contains the phonetic constraints that must be taken into account in the formation of Arabic roots (for more details see all_phonetic_rules.xml in http://arabic.emi.ac.ma/alelm/?q=Resources). the second part contains the lexicons that were used to create this lexicon (see in lexicons tag). the third part contains the roots. ISLRN: 813-907-570-946-2
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

3. Arabic WordNet ontology

Creator:: Abouenour, Lahcen, Bouzoubaa, Karim, and Rosso, Paulo
Publisher:: Wordnet
Type:: text, wordnet, and lexicalConceptualResource
Subject:: WordNet
Language:: Arabic
Description:: This improved version is an extension of the original Arabic Wordnet (http://globalwordnet.org/arabic-wordnet/awn-browser/), it was enriched by new verbs, nouns including the broken plurals that is a specific form for Arabic words.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

4. Broken plural list

Creator:: Ouamer, meriem, Bouzoubaa, Karim, and Tajmout, rachida
Publisher:: ALELM research group
Type:: text, wordList, and lexicalConceptualResource
Subject:: Broken plural
Language:: Arabic
Description:: An LMF conformant XML-based file containing a comprehensive Arabic broken plural list. The file contains 12,249 singular words with their corresponding BPs
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

5. C4Corpus (CC BY-NC-SA part)

Creator:: Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
Publisher:: Technische Universität Darmstadt
Type:: text and corpus
Subject:: CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
Language:: Afrikaans, Arabic, Bengali, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Malayalam, Marathi, Macedonian, Nepali (macrolanguage), Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Spanish, Albanian, Swahili (macrolanguage), Swedish, Tamil, Telugu, Tagalog, Thai, Turkish, Ukrainian, Undetermined, Urdu, Vietnamese, and Chinese
Description:: A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

6. CALEM (Comprehensive Arabic LEMmas)

Creator:: Namly, Driss, Bouzoubaa, Karim, and El Jihad, Abdelhamid
Publisher:: ALELM
Type:: text, lexicon, and lexicalConceptualResource
Subject:: lexicon, lemmatization, and stemming;
Language:: Arabic
Description:: Comprehensive Arabic LEMmas is a lexicon covering a large list of Arabic lemmas and their corresponding inflected word forms (stems) with details (POS + Root). Each lexical entry represents a lemma followed by all its possible stems and each stem is enriched by its morphological features especially the root and the POS. It is composed of 164,845 lemmas representing 7,200,918 stems, detailed as follow: 757 Arabic particles 2,464,631 verbal stems 4,735,587 nominal stems The lexicon is provided as an LMF conformant XML-based file in UTF8 encoding, which represents about 1,22 Gb of data. Citation: – Namly Driss, Karim Bouzoubaa, Abdelhamid El Jihad, and Si Lhoussain Aouragh. “Improving Arabic Lemmatization Through a Lemmas Database and a Machine-Learning Technique.” In Recent Advances in NLP: The Case of Arabic Language, pp. 81-100. Springer, Cham, 2020.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

7. CEHugeWebCorpus

Creator:: Rüdiger, Jan Oliver
Publisher:: Rüdiger, Jan Oliver
Type:: text and corpus
Subject:: corpus, German, Germanistik, Web corpus, web corpora, and CorpusExplorer
Language:: German
Description:: This corpus was originally created for performance testing (server infrastructure CorpusExplorer - see: diskurslinguistik.net / diskursmonitor.de). It includes the filtered database (German texts only) of CommonCrawl (as of March 2018). First, the URLs were filtered according to their top-level domain (de, at, ch). Then the texts were classified using NTextCat and only uniquely German texts were included in the corpus. The texts were then annotated using TreeTagger (token, lemma, part-of-speech). 2.58 million documents - 232.87 million sentences - 3.021 billion tokens. You can use CorpusExplorer (http://hdl.handle.net/11234/1-2634) to convert this data into various other corpus formats (XML, JSON, Weblicht, TXM and many more).
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), PUB, and http://creativecommons.org/licenses/by-nc-sa/4.0/

8. CERED baseline models

Creator:: Šimečková, Zuzana and Straka, Milan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: mlmodel, text, and languageDescription
Subject:: relationship extraction
Language:: Czech
Description:: Relationship extraction models for the Czech language. Models are trained on CERED (dataset created by distant supervision on Czech Wikipedia and Wikidata) and recognize a subset of Wikidata relations (listed in CEREDx.LABELS). We supply a demo.py that performs inference on user-defined input and requirements.txt file for pip. Adapt the demo code to use the model. Both the dataset and the models are presented in Relationship Extraction thesis.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), PUB, and http://creativecommons.org/licenses/by-nc-sa/4.0/

9. CLEF-TREC Q/A

Creator:: Abouenour, lahcen, Bouzoubaa, Karim, and Rosso, paolo
Publisher:: ALELM
Type:: text, other, and lexicalConceptualResource
Subject:: CLEF and TREC
Language:: Arabic
Description:: List of 2264 questions + answers of CLEF and TREC, translated to Arabic
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

10. CoCzeFLA Chroma 2022.07

Creator:: Chromá, Anna and Matiasovitsová, Klára
Publisher:: Faculty of Arts, Charles Univesity
Type:: text and corpus
Subject:: first language acquisition, typical development, and monolingual corpus
Language:: Czech
Description:: Transcripts of longitudinal audio recordings of 7 Czech typical monolingual children between 1;7 to 3;9. Files are in plain text with UTF-8 encoding. Each file represents one recording session of one of the target children and is named with the presudonym of the child and her age at the given session in form YMMDD. Transcription rules and other details are to find on the homepage coczefla.ff.cuni.cz.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

11. CoCzeFLA Chroma 2023.04

Creator:: Chromá, Anna, Matiasovitsová, Klára, Sláma, Jakub, and Treichelová, Jolana
Publisher:: Charles University, Faculty of Arts
Type:: text and corpus
Subject:: first language acquisition, typical development, longitudinal corpus, and Czech
Language:: Czech
Description:: A new version of the previously published corpus Chroma. The version 2023.04 includes six children. Two transcripts (Julie20221, Klara30424) were removed since they did not meet the criteria on the dialogical format. The transcripts were revised (eliminating typing errors and inconsistencies in the transcription format) and morphologically annotated by the automatic tool MorphoDiTa. Detailed manual control of the annotation was performed on children's utterances; the annotation of adult data was not checked yet. Files are in plain text with UTF-8 encoding. Each file represents one recording session of one of the target children and is named with the alias of the child and their age at the given session in form YMMDD. Transcription rules and other details can be found on the homepage coczefla.ff.cuni.cz.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

12. CoCzeFLA Chroma 2023.07

Creator:: Chromá, Anna, Sláma, Jakub, Matiasovitsová, Klára, and Kohoutková, Jolana
Publisher:: Charles University, Faculty of Arts
Type:: text and corpus
Subject:: first language acquisition, typical development, longitudinal corpus, and Czech
Language:: Czech
Description:: A new version of the previously published corpus Chroma wih morphological annotation. The version 2023.07 differs from 2023.04 in that it includes all seven children and it went through an additional careful check of consistency and conformity to the CHAT transcription principles. Two transcripts (Julie20221, Klara30424) from the previous versions (2022.07, 2019.07) were removed since they did not meet our criteria on dialogical format. All transcripts of recordings made during one day were split into one file. Thus, version 2023.07 consists of 183 files/transcripts. The number of utterances and tokens given here in LINDAT corresponds to children's lines only. Files are in plain text with UTF-8 encoding. Each file represents one recording session of one of the target children and is named with the alias of the child and their age at the given session in form YMMDD. Transcription rules and other details can be found on the homepage coczefla.ff.cuni.cz.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

13. CoNLL 2017 Shared Task - Automatically Annotated Raw Texts and Word Embeddings

Creator:: Ginter, Filip, Hajič, Jan, Luotolahti, Juhani, Straka, Milan, and Zeman, Daniel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, mlmodel, and languageDescription
Subject:: CoNLL 2017, word embeddings, and automatic annotation
Language:: Multiple languages
Description:: Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts in 45 languages, generated by UDPipe (http://ufal.mff.cuni.cz/udpipe), together with word embeddings of dimension 100 computed from lowercased texts by word2vec (https://code.google.com/archive/p/word2vec/). For each language, automatic annotations in CoNLL-U format are provided in a separate archive. The word embeddings for all languages are distributed in one archive. Note that the CC BY-SA-NC 4.0 license applies to the automatically generated annotations and word embeddings, not to the underlying data, which may have different license and impose additional restrictions. Update 2018-09-03 =============== Added data in the 4 “surprise languages” from the 2017 ST: Buryat, Kurmanji, North Sami and Upper Sorbian. This has been promised before, during CoNLL-ST 2018 we gave the participants a link to this record saying the data was here. It wasn't, sorry. But now it is.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

14. Corpus for training and evaluating diacritics restoration systems

Creator:: Náplava, Jakub, Straka, Milan, Hajič, Jan, and Straňák, Pavel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: diacritical marks generation and natural language correction
Language:: Czech, Vietnamese, Romanian, Polish, Slovak, Spanish, Croatian, Irish, Latvian, Hungarian, French, and Turkish
Description:: Corpus of texts in 12 languages. For each language, we provide one training, one development and one testing set acquired from Wikipedia articles. Moreover, each language dataset contains (substantially larger) training set collected from (general) Web texts. All sets, except for Wikipedia and Web training sets that can contain similar sentences, are disjoint. Data are segmented into sentences which are further word tokenized. All data in the corpus contain diacritics. To strip diacritics from them, use Python script diacritization_stripping.py contained within attached stripping_diacritics.zip. This script has two modes. We generally recommend using method called uninames, which for some languages behaves better. The code for training recurrent neural-network based model for diacritics restoration is located at https://github.com/arahusky/diacritics_restoration.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

15. CsEnVi Pairwise Parallel Corpora

Creator:: Hoang, Duc Tam and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: corpus, Vietnamese, parallel corpus, Czech-Vietnamese corpus, and English-Vietnamese corpus
Language:: Czech, English, and Vietnamese
Description:: CsEnVi Pairwise Parallel Corpora consist of Vietnamese-Czech parallel corpus and Vietnamese-English parallel corpus. The corpora were assembled from the following sources: - OPUS, the open parallel corpus is a growing multilingual corpus of translated open source documents. The majority of Vi-En and Vi-Cs bitexts are subtitles from movies and television series. The nature of the bitexts are paraphrasing of each other's meaning, rather than translations. - TED talks, a collection of short talks on various topics, given primarily in English, transcribed and with transcripts translated to other languages. In our corpus, we use 1198 talks which had English and Vietnamese transcripts available and 784 talks which had Czech and Vietnamese transcripts available in January 2015. The size of the original corpora collected from OPUS and TED talks is as follows: CS/VI EN/VI Sentence 1337199/1337199 2035624/2035624 Word 9128897/12073975 16638364/17565580 Unique word 224416/68237 91905/78333 We improve the quality of the corpora in two steps: normalizing and filtering. In the normalizing step, the corpora are cleaned based on the general format of subtitles and transcripts. For instance, sequences of dots indicate explicit continuation of subtitles across multiple time frames. The sequences of dots are distributed differently in the source and the target side. Removing the sequence of dots, along with a number of other normalization rules, improves the quality of the alignment significantly. In the filtering step, we adapt the CzEng filtering tool [1] to filter out bad sentence pairs. The size of cleaned corpora as published is as follows: CS/VI EN/VI Sentence 1091058/1091058 1113177/1091058 Word 6718184/7646701 8518711/8140876 Unique word 195446/59737 69513/58286 The corpora are used as training data in [2]. References: [1] Ondřej Bojar, Zdeněk Žabokrtský, et al. 2012. The Joy of Parallelism with CzEng 1.0. Proceedings of LREC2012. ELRA. Istanbul, Turkey. [2] Duc Tam Hoang and Ondřej Bojar, The Prague Bulletin of Mathematical Linguistics. Volume 104, Issue 1, Pages 75–86, ISSN 1804-0462. 9/2015
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

16. Czech Court Decisions Dataset

Creator:: Kríž, Vincent and Hladká, Barbora
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: named entities, annotation, and corpus
Language:: Czech
Description:: We present the Czech Court Decisions Dataset (CCDD) -- a dataset of 300 manually annotated court decisions published by The Supreme Court of the Czech Republic and the Constitutional Court of the Czech Republic.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

17. Czech HS Contracts Dataset (CHSC) 1.0

Creator:: Szabó, Adam and Straka, Milan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: Czech, document classification, contracts, and Hlídač státu
Language:: Czech
Description:: Czech Contracts dataset was created as a part of the thesis Low-resource Text Classification (2021), A. Szabó, MFF UK. Contracts are obtained from the Hlídač Státu web portal. Labels in the development and training set are automatically classified on the basis of the keyword method according to the thesis Automatická klasifikace smluv pro portál HlidacSmluv.cz, J. Maroušek (2020), MFF UK. For this reason, the goal in the classification is not to achieve 100% on the development set, as the classification contains a certain amount of noise. The test set is manually annotated. The dataset contains a total of 97493 contracts.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), PUB, and http://creativecommons.org/licenses/by-nc-sa/4.0/

18. Czech Legal Text Treebank

Creator:: Kríž, Vincent, Hladká, Barbora, and Urešová, Zdeňka
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: treebank, corpus, Czech, legal texts, and legal domain
Language:: Czech
Description:: The Czech Legal Text Treebank (CLTT) is a collection of 1133 manually annotated dependency trees. CLTT consists of two legal documents: The Accounting Act (563/1991 Coll., as amended) and Decree on Double-entry Accounting for undertakers (500/2002 Coll., as amended).
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

19. Czech Legal Text Treebank 2.0

Creator:: Kríž, Vincent and Hladká, Barbora
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: treebank, Prague dependencies, named entities, and semantic relations
Language:: Czech
Description:: The Czech Legal Text Treebank 2.0 (CLTT 2.0) annotates the same texts as the CLTT 1.0. These texts come from the legal domain and they are manually syntactically annotated. The CLTT 2.0 annotation on the syntactic layer is more elaborate than in the CLTT 1.0 from various aspects. In addition, new annotation layers were added to the data: (i) the layer of accounting entities, and (ii) the layer of semantic entity relations.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

20. Czech Lexico-Semantic Database 0.1

Creator:: Tichy, Ondrej, Obstova, Zora, and Klegr, Ales
Publisher:: Charles University, Faculty of Arts
Type:: text, thesaurus, and lexicalConceptualResource
Subject:: onomasiological lexicography, thesaurus, lexico-semantic database, digitization, and Czech
Language:: Czech
Description:: A lexicographical project, whose aim is to digitize and align two Czech onomasiological dictionaries (Haller 1969–77; Klégr 2007) in order to create an integrated digital multi-purpose lexico-semantic database of Czech.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

1. AKCES-GEC Grammatical Error Correction Dataset for Czech

2. Arabic Triliteral roots Lexicon

3. Arabic WordNet ontology

4. Broken plural list

5. C4Corpus (CC BY-NC-SA part)

6. CALEM (Comprehensive Arabic LEMmas)

7. CEHugeWebCorpus

8. CERED baseline models

9. CLEF-TREC Q/A

10. CoCzeFLA Chroma 2022.07

11. CoCzeFLA Chroma 2023.04

12. CoCzeFLA Chroma 2023.07

13. CoNLL 2017 Shared Task - Automatically Annotated Raw Texts and Word Embeddings

14. Corpus for training and evaluating diacritics restoration systems

15. CsEnVi Pairwise Parallel Corpora

16. Czech Court Decisions Dataset

17. Czech HS Contracts Dataset (CHSC) 1.0

18. Czech Legal Text Treebank

19. Czech Legal Text Treebank 2.0

20. Czech Lexico-Semantic Database 0.1

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Creator

Show values starting with

Language

Show values starting with

Publisher

Show values starting with

Rights

Subject

Show values starting with

Type

Show values starting with

Date

Original context has metadata only

Harvested from