AlbMoRe is a sentiment analysis corpus of movie reviews in Albanian, consisting of 800 records in CSV format. Each record includes a text review retrieved from IMDb and translated in Albanian by the author. It also contains a 0 negative) or 1 (positive) label added by the author. The corpus is fully balanced, consisting of 400 positive and 400 negative reviews about 67 movies of different genres. AlbMoRe corpus is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). If using the data, please cite the following paper: Çano Erion. AlbMoRe: A Corpus of Movie Reviews for Sentiment Analysis in Albanian. CoRR, abs/2306.08526, 2023. URL https://arxiv.org/abs/2306.08526.
AlbNER is a Named Entity Recognition corpus of Wikipedia sentences in Albanian, consisting of 900 records. The sentence tokens are manually labeled complying with the CoNLL-2003 shared task annotation scheme explained at https://aclanthology.org/W03-0419.pdf that uses I-ORG, B-ORG, I-PER, B-PER, I-LOC, B-LOC, I-MISC, B-MISC and O tags. AlbNER data are released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). If using AlbMoRe corpus, please cite the following paper: Çano Erion. AlbNER: A Corpus for Named Entity Recognition in Albanian. CoRR, abs/2309.08741, 2023. URL https://arxiv.org/abs/2309.08741.
AlbNews is a topic modeling corpus of news headlines in Albanian, consisting of 600 labeled samples and 2600 unlabeled samples. Each labeled sample includes a headline text retrieved from Albanian online news portals. It also contains one of the four labels: 'pol' for politics, 'cul' for culture, 'eco' for economy, and 'spo' for sport. Each of the unlabeled samples contain a headline text only.AlbTopic corpus is released under CC-BY 4.0 license (https://creativecommons.org/licenses/by/4.0/). If using the data, please cite the following paper:
Çano Erion, Lamaj Dario. AlbNews: A Corpus of Headlines for Topic Modeling in Albanian. CoRR, abs/2402.04028, 2024. URL: https://arxiv.org/abs/2402.04028.
Annotated corpus of 350 decision of Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court).
280 decisions were annotated by one trained annotator and then manually adjudicated by one trained curator. 70 decisions were annotated by two trained annotators and then manually adjudicated by one trained curator. Adjudication was conducted destructively, therefore dataset contains only the correct annotations and does not contain all original annotations.
Corpus was developed as training and testing material for text segmentation tasks. Dataset contains decision segmented into Header, Procedural History, Submission/Rejoinder, Court Argumentation, Footer, Footnotes, and Dissenting Opinion. Segmentation allows to treat different parts of text differently even if it contains similar linguistic or other features.
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
COSTRA 1.0 is a dataset of Czech complex sentence transformations. The dataset is intended for the study of sentence-level embeddings beyond simple word alternations or standard paraphrasing.
The dataset consist of 4,262 unique sentences with average length of 10 words, illustrating 15 types of modifications such as simplification, generalization, or formal and informal language variation.
The hope is that with this dataset, we should be able to test semantic properties of sentence embeddings and perhaps even to find some topologically interesting “skeleton” in the sentence embedding space.
Costra 1.1 is a new dataset for testing geometric properties of sentence embeddings spaces. In particular, it concentrates on examining how well sentence embeddings capture complex phenomena such paraphrases, tense or generalization. The dataset is a direct expansion of Costra 1.0, which was extended with more sentences and sentence comparisons.
This is a document-aligned parallel corpus of English and Czech abstracts of scientific papers published by authors from the Institute of Formal and Applied Linguistics, Charles University in Prague, as reported in the institute's system Biblio. For each publication, the authors are obliged to provide both the original abstract in Czech or English, and its translation into English or Czech, respectively. No filtering was performed, except for removing entries missing the Czech or English abstract, and replacing newline and tabulator characters by spaces.
This is a parallel corpus of Czech and mostly English abstracts of scientific papers and presentations published by authors from the Institute of Formal and Applied Linguistics, Charles University in Prague. For each publication record, the authors are obliged to provide both the original abstract (in Czech or English), and its translation (English or Czech) in the internal Biblio system. The data was filtered for duplicates and missing entries, ensuring that every record is bilingual. Additionally, records of published papers which are indexed by SemanticScholar contain the respective link. The dataset was created from September 2022 image of the Biblio database and is stored in JSONL format, with each line corresponding to one record.