The segment of Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel), 1938, issue no. 10 consists of a montage of archive film material created to mark the 88th anniversary of the birth of the late President Tomáš Garrique Masaryk.
onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts. The tool has been implemented in Python, licensed under New BSD License and made an open source software (available for download including the source code at http://code.google.com/p/onion/). It is being successfuly used for cleaning large textual corpora at Natural language processing centre at Faculty of informatics, Masaryk university Brno and it's industry partners. The research leading to this piece of software was published in author's Ph.D. thesis "Removing Boilerplate and Duplicate Content from Web Corpora". The deduplication algorithm is based on comparing n-grams of words of text. The author's algorithm has been shown to be more suitable for textual corpora deduplication than competing algorithms (Broder, Charikar): in addition to detection of identical or very similar (95 %) duplicates, it is able to detect even partially similar duplicates (50 %) still achieving great performace (further described in author's Ph.D. thesis). The unique deduplication capabilities and scalability of the algorithm were been demonstrated while building corpora of American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting --- several TB of text documents were deduplicated resulting in corpora of 70 billions tokens altogether. and PRESEMT, Lexical Computing Ltd
Omorfi is free and open source project containing various tools and data for handling Finnish texts in a linguistically motivated manner. The main components of this repository are:
1) a lexical database containing hundreds of thousands of words (c.f. lexical statistics),
2) a collection of scripts to convert lexical database into formats used by upstream NLP tools (c.f. lexical processing),
3) an autotools setup to build and install (or package, or deploy): the scripts, the database, and simple APIs / convenience processing tools, and
4) a collection of relatively simple APIs for a selection of languages and scripts to apply the NLP tools and access the database
The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data has been bundled with system submissions, supporting software, an additional SDP-style collection of semantic dependency graphs, and additional background material (from which some of the SDP target representations were derived) for release through the Linguistic Data Consortium (with LDC catalogue number LDC2016 T10).
One of the four English target representations (viz. DM) and the entire Czech data (in the PSD target representation) are not derivative of LDC-licensed annotations and, thus, can be made available for direct download (Open SDP; version 1.1; April 2016) under a more permissive licensing scheme, viz. the Creative Common Attribution-NonCommercial-ShareAlike license. This package also includes some ‘richer’ meaning representations from which the English bi-lexical DM graphs derive, viz. scope-underspecified logical forms and more abstract, non-lexicalized ‘semantic networks’. The latter of these are formally (if not linguistically) similar to Abstract Meaning Representation (AMR) and are available in a range of serializations, including in AMR-like syntax.
Please use the following bibliographic reference for the SDP 2016 data:
@string{C:LREC = {{I}nternational {C}onference on
{L}anguage {R}esources and {E}valuation}}
@string{LREC:16 = {Proceedings of the 10th } # C:LREC}
@string{L:LREC:16 = {Portoro\v{z}, Slovenia}}
@inproceedings{Oep:Kuh:Miy:16,
author = {Oepen, Stephan and Kuhlmann, Marco and Miyao, Yusuke
and Zeman, Daniel and Cinkov{\'a}, Silvie
and Flickinger, Dan and Haji\v{c}, Jan
and Ivanova, Angelina and Ure\v{s}ov{\'a}, Zde\v{n}ka},
title = {Towards Comparability of Linguistic Graph Banks for Semantic Parsing},
booktitle = LREC:16
year = 2016,
address = L:LREC:16,
pages = {3991--3995}
}
The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data has been bundled with system submissions, supporting software, an additional SDP-style collection of semantic dependency graphs, and additional background material (from which some of the SDP target representations were derived) for release through the Linguistic Data Consortium (with LDC catalogue number LDC2016 T10).
One of the four English target representations (viz. DM) and the entire Czech data (in the PSD target representation) are not derivative of LDC-licensed annotations and, thus, can be made available for direct download (Open SDP; version 1.2; January 2017) under a more permissive licensing scheme, viz. the Creative Common Attribution-NonCommercial-ShareAlike license. This package also includes some ‘richer’ meaning representations from which the English bi-lexical DM graphs derive, viz. scope-underspecified logical forms and more abstract, non-lexicalized ‘semantic networks’. The latter of these are formally (if not linguistically) similar to Abstract Meaning Representation (AMR) and are available in a range of serializations, including in AMR-like syntax.
Version 1.1 was released April 2016. Version 1.2 adds the 2015 Turku system, which was accidentally left out from version 1.1.
Please use the following bibliographic reference for the SDP 2016 data:
@string{C:LREC = {{I}nternational {C}onference on
{L}anguage {R}esources and {E}valuation}}
@string{LREC:16 = {Proceedings of the 10th } # C:LREC}
@string{L:LREC:16 = {Portoro\v{z}, Slovenia}}
@inproceedings{Oep:Kuh:Miy:16,
author = {Oepen, Stephan and Kuhlmann, Marco and Miyao, Yusuke
and Zeman, Daniel and Cinkov{\'a}, Silvie
and Flickinger, Dan and Haji\v{c}, Jan
and Ivanova, Angelina and Ure\v{s}ov{\'a}, Zde\v{n}ka},
title = {Towards Comparability of Linguistic Graph Banks for Semantic Parsing},
booktitle = LREC:16
year = 2016,
address = L:LREC:16,
pages = {3991--3995}
}
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 28A, B from 1944 was shot during the official opening of the Week of Czech Youth organised by the Board of Trustees for the Education of Youth and held in the courtyard of Karlštejn Castle on 1 July. The ceremony was attended by Minister of Education and People´s Enlightenment and Chairman of the Board Emanuel Moravec and SS officer Ferdinand Fischer. General Secretary of the Board František Teuner spoke to the participants.
OpenLegalData is a free and open platform that makes legal documents and information available to the public. The aim of this platform is to improve the transparency of jurisprudence with the help of open data and to help people without legal training to understand the justice system. The project is committed to the Open Data principles and the Free Access to Justice Movement.
OpenLegalData's DUMP as of 2022-10-18 was used to create this corpus. The data was cleaned, automatically annotated (TreeTagger: POS & Lemma) and grouped based on the metadata (jurisdiction - BundeslandID - sub-size if applicable - ex: Verwaltungsgerichtsbarkeit_11_05.cec6.gz - jurisdiction: administrative jurisdiction, BundeslandID = 11 - sub-corpus = 05). Sub-corpora are randomly split into 50 MB each.
Corpus data is available in CEC6 format. This can be converted into many different corpus formats - use the software www.CorpusExplorer.de if necessary.