AlbMoRe is a sentiment analysis corpus of movie reviews in Albanian, consisting of 800 records in CSV format. Each record includes a text review retrieved from IMDb and translated in Albanian by the author. It also contains a 0 negative) or 1 (positive) label added by the author. The corpus is fully balanced, consisting of 400 positive and 400 negative reviews about 67 movies of different genres. AlbMoRe corpus is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). If using the data, please cite the following paper: Çano Erion. AlbMoRe: A Corpus of Movie Reviews for Sentiment Analysis in Albanian. CoRR, abs/2306.08526, 2023. URL https://arxiv.org/abs/2306.08526.
AlbNER is a Named Entity Recognition corpus of Wikipedia sentences in Albanian, consisting of 900 records. The sentence tokens are manually labeled complying with the CoNLL-2003 shared task annotation scheme explained at https://aclanthology.org/W03-0419.pdf that uses I-ORG, B-ORG, I-PER, B-PER, I-LOC, B-LOC, I-MISC, B-MISC and O tags. AlbNER data are released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). If using AlbMoRe corpus, please cite the following paper: Çano Erion. AlbNER: A Corpus for Named Entity Recognition in Albanian. CoRR, abs/2309.08741, 2023. URL https://arxiv.org/abs/2309.08741.
AlbNews is a topic modeling corpus of news headlines in Albanian, consisting of 600 labeled samples and 2600 unlabeled samples. Each labeled sample includes a headline text retrieved from Albanian online news portals. It also contains one of the four labels: 'pol' for politics, 'cul' for culture, 'eco' for economy, and 'spo' for sport. Each of the unlabeled samples contain a headline text only.AlbTopic corpus is released under CC-BY 4.0 license (https://creativecommons.org/licenses/by/4.0/). If using the data, please cite the following paper:
Çano Erion, Lamaj Dario. AlbNews: A Corpus of Headlines for Topic Modeling in Albanian. CoRR, abs/2402.04028, 2024. URL: https://arxiv.org/abs/2402.04028.
Annotated corpus of 350 decision of Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court).
Every decision is annotated by two trained annotators and then manually adjudicated by one trained curator to solve possible disagreements between annotators. Adjudication was conducted non-destructively, therefore dataset contains all original annotations.
Corpus was developed as training and testing material for reference recognition tasks. Dataset contains references to other court decisions and literature. All references consist of basic units (identifier of court decision, identification of court issuing referred decision, author of book or article, title of book or article, point of interest in referred document etc.), values (polarity, depth of discussion etc.).
Annotated corpus of 350 decision of Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court).
Every decision is annotated by two trained annotators and then manually adjudicated by one trained curator to solve possible disagreements between annotators. Adjudication was conducted non-destructively, therefore corpus (raw) contains all original annotations.
Corpus was developed as training and testing material for reference recognition tasks. Dataset contains references to other court decisions and literature. All references consist of basic units (identifier of court decision, identification of court issuing referred decision, author of book or article, title of book or article, point of interest in referred document etc.), values (polarity, depth of discussion etc.).
Annotated corpus of 350 decision of Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court).
280 decisions were annotated by one trained annotator and then manually adjudicated by one trained curator. 70 decisions were annotated by two trained annotators and then manually adjudicated by one trained curator. Adjudication was conducted destructively, therefore dataset contains only the correct annotations and does not contain all original annotations.
Corpus was developed as training and testing material for text segmentation tasks. Dataset contains decision segmented into Header, Procedural History, Submission/Rejoinder, Court Argumentation, Footer, Footnotes, and Dissenting Opinion. Segmentation allows to treat different parts of text differently even if it contains similar linguistic or other features.
We defined 58 dramatic situations and annotated them in 19 play scripts. Then we selected only 5 well-recognized dramatic situations and annotated further 33 play scripts. In this version of the data, we release only play scripts that can be freely distributed, which is 9 play scripts. One play is annotated independently by three annotators.
Description : This is an online edition of An Anglo-Saxon Dictionary, or a dictionary of "Old English". The dictionary records the state of the English language as it was used between ca. 700-1100 AD by the Anglo-Saxon inhabitants of the British Isles.
This project is based on a digital edition of An Anglo-Saxon dictionary, based on the manuscript collections of the late Joseph Bosworth (the so called Main Volume, first edition 1898) and its Supplement (first edition 1921), edited by Joseph Bosworth and T. Northcote Toller, today the largest complete dictionary of Old English (one day to be hopefully supplanted by the DOE). Alistair Campbell's "enlarged addenda and corrigenda" from 1972 are not public domain and are therefore not part of the online dictionary. Please see the front & back matter of the paper dictionary for further information, prefaces and lists of references & contractions.
The digitization project was initiated by Sean Crist in 2001 as a part of his Germanic Lexicon Project and many individuals and institutions have contributed to this project. Check out the original GLP webpage and the old Bosworth-Toller offline application webpage (to be updated). Currently the project is hosted by the Faculty of Arts, Charles University.
In 2010, the data from the GLP were converted to create the current site. Care was taken to preserve the typography of the original dictionary, but also provide a modern, user friendly interface for contemporary users.
In 2013, the entries were structurally re-tagged and the original typography was abandoned, though the immediate access to the scans of the paper dictionary was preserved.
Our aim is to reach beyond a simple digital edition and create an online environment dedicated to all interested in Old English and Anglo-Saxon culture. Feel free to join in the editing of the Dictionary, commenting on its numerous entries or participating in the discussions at our forums.
We hope that by drawing the attention of the community of Anglo-Saxonists to our site and joining our resources, we may create a more useful tool for everybody. The most immediate project to draw on the corrected and tagged data of the Dictionary is a Morphological Analyzer of Old English (currently under development).
We are grateful for the generous support of the Charles University Grant Agency and for the free hosting at the Faculty of Arts at Charles University. The site is currently maintained and developed by Ondrej Tichy et al. at the Department of English Language and ELT Methodology, Faculty of Arts, Charles University in Prague (Czech Republic).
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.