Syntactic (including deep-syntactic - tectogrammatical) annotation of user-generated noisy sentences. The annotation was made on Czech-English and English-Czech Faust Dev/Test sets.
The English data includes manual annotations of English reference translations of Czech source texts. This texts were translated independently by two translators. After some necessary cleanings, 1000 segments were randomly selected for manual annotation. Both the reference translations were annotated, which means 2000 annotated segments in total.
The Czech data includes manual annotations of Czech reference translations of English source texts. This texts were translated independently by three translators. After some necessary cleanings, 1000 segments were randomly selected for manual annotation. All three reference translations were annotated, which means 3000 annotated segments in total.
Faust is part of PDT-C 1.0 (http://hdl.handle.net/11234/1-3185).
This machine translation test set contains 2223 Czech sentences collected within the FAUST project (https://ufal.mff.cuni.cz/grants/faust, http://hdl.handle.net/11234/1-3308).
Each original (noisy) sentence was normalized (clean1 and clean2) and translated to English independently by two translators.
This package contains data sets for development and testing of machine translation of medical search short queries between Czech, English, French, and German. The queries come from general public and medical experts. and This work was supported by the EU FP7 project Khresmoi (European Comission contract No. 257528). The language resources are distributed by the LINDAT/Clarin project of the Ministry of Education, Youth and Sports of the Czech Republic (project no. LM2010013).
We thank Health on the Net Foundation for granting the license for the English general public queries, TRIP database for granting the license for the English medical expert queries, and three anonymous translators and three medical experts for translating amd revising the data.
This package contains data sets for development and testing of machine translation of medical queries between Czech, English, French, German, Hungarian, Polish, Spanish ans Swedish. The queries come from general public and medical experts. This is version 2.0 extending the previous version by adding Hungarian, Polish, Spanish, and Swedish translations.
This package contains data sets for development and testing of machine translation of sentences from summaries of medical articles between Czech, English, French, and German. and This work was supported by the EU FP7 project Khresmoi (European Comission contract No. 257528). The language resources are distributed by the LINDAT/Clarin project of the Ministry of Education, Youth and Sports of the Czech Republic (project no. LM2010013). We thank all the data providers and copyright holders for providing the source data and anonymous experts for translating the sentences.
This package contains data sets for development (Section dev) and testing (Section test) of machine translation of sentences from summaries of medical articles between Czech, English, French, German, Hungarian, Polish, Spanish
and Swedish. Version 2.0 extends the previous version by adding Hungarian, Polish, Spanish, and Swedish translations.
Mapping table for the article Hajič et al., 2024: Mapping Czech Verbal Valency to PropBank Argument Labels, in LREC-COLING 2024, as preprocess by the algorithm described in the paper. This dataset i smeant for verification (replicatoin) purposes only. It will b manually processed further to arrive at a workable CzezchpropBank, to be used in Czech UMR annotation, to be further updated during the annotation. The resulting PropBank frame files fir Czech are expected to be available with some future releases of UMR, containing Czech UMR annotation, or separately.
The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data has been bundled with system submissions, supporting software, an additional SDP-style collection of semantic dependency graphs, and additional background material (from which some of the SDP target representations were derived) for release through the Linguistic Data Consortium (with LDC catalogue number LDC2016 T10).
One of the four English target representations (viz. DM) and the entire Czech data (in the PSD target representation) are not derivative of LDC-licensed annotations and, thus, can be made available for direct download (Open SDP; version 1.1; April 2016) under a more permissive licensing scheme, viz. the Creative Common Attribution-NonCommercial-ShareAlike license. This package also includes some ‘richer’ meaning representations from which the English bi-lexical DM graphs derive, viz. scope-underspecified logical forms and more abstract, non-lexicalized ‘semantic networks’. The latter of these are formally (if not linguistically) similar to Abstract Meaning Representation (AMR) and are available in a range of serializations, including in AMR-like syntax.
Please use the following bibliographic reference for the SDP 2016 data:
@string{C:LREC = {{I}nternational {C}onference on
{L}anguage {R}esources and {E}valuation}}
@string{LREC:16 = {Proceedings of the 10th } # C:LREC}
@string{L:LREC:16 = {Portoro\v{z}, Slovenia}}
@inproceedings{Oep:Kuh:Miy:16,
author = {Oepen, Stephan and Kuhlmann, Marco and Miyao, Yusuke
and Zeman, Daniel and Cinkov{\'a}, Silvie
and Flickinger, Dan and Haji\v{c}, Jan
and Ivanova, Angelina and Ure\v{s}ov{\'a}, Zde\v{n}ka},
title = {Towards Comparability of Linguistic Graph Banks for Semantic Parsing},
booktitle = LREC:16
year = 2016,
address = L:LREC:16,
pages = {3991--3995}
}
The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data has been bundled with system submissions, supporting software, an additional SDP-style collection of semantic dependency graphs, and additional background material (from which some of the SDP target representations were derived) for release through the Linguistic Data Consortium (with LDC catalogue number LDC2016 T10).
One of the four English target representations (viz. DM) and the entire Czech data (in the PSD target representation) are not derivative of LDC-licensed annotations and, thus, can be made available for direct download (Open SDP; version 1.2; January 2017) under a more permissive licensing scheme, viz. the Creative Common Attribution-NonCommercial-ShareAlike license. This package also includes some ‘richer’ meaning representations from which the English bi-lexical DM graphs derive, viz. scope-underspecified logical forms and more abstract, non-lexicalized ‘semantic networks’. The latter of these are formally (if not linguistically) similar to Abstract Meaning Representation (AMR) and are available in a range of serializations, including in AMR-like syntax.
Version 1.1 was released April 2016. Version 1.2 adds the 2015 Turku system, which was accidentally left out from version 1.1.
Please use the following bibliographic reference for the SDP 2016 data:
@string{C:LREC = {{I}nternational {C}onference on
{L}anguage {R}esources and {E}valuation}}
@string{LREC:16 = {Proceedings of the 10th } # C:LREC}
@string{L:LREC:16 = {Portoro\v{z}, Slovenia}}
@inproceedings{Oep:Kuh:Miy:16,
author = {Oepen, Stephan and Kuhlmann, Marco and Miyao, Yusuke
and Zeman, Daniel and Cinkov{\'a}, Silvie
and Flickinger, Dan and Haji\v{c}, Jan
and Ivanova, Angelina and Ure\v{s}ov{\'a}, Zde\v{n}ka},
title = {Towards Comparability of Linguistic Graph Banks for Semantic Parsing},
booktitle = LREC:16
year = 2016,
address = L:LREC:16,
pages = {3991--3995}
}
The first edition of a speech corpus with a speech reconstruction layer (edited transcript).
The project of speech reconstruction of Czech and English has been started at UFAL together with the PIRE project in 2005, and has gradually grown from ideas to (first) annotation specification, annotation software and actual annotation. It is part of the Prague Dependency Treebank family of annotated corpus resources and tools, to which it adds the spoken language layer(s). and LC536; MSM0021620838; IST-034344; ME838