Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia).
Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia).
Changes in version 1.1:
1. Universal Dependencies tagset instead of the older and smaller Google Universal POS tagset.
2. SVM classifier trained on Universal Dependencies 1.2 instead of HamleDT 2.0.
3. Balto-Slavic languages, Germanic languages and Romance languages were tagged by classifier trained only on the respective group of languages. Other languages were tagged by a classifier trained on all available languages. The "c7" combination from version 1.0 is no longer used.
This software package includes three tools: web frontend for machine translation featuring phonetic transcription of Ukrainian suitable for Czech speakers, API server and a tool for translation of documents with markup (html, docx, odt, pptx, odp,...). These tools are used in the Charles Translator service (https://translator.cuni.cz).
This software was developed within the EdUKate project, which aims to help mitigate language barriers between non-Czech-speaking children in the Czech Republic and the education in the Czech school system. The project focuses on the development and dissemination of multilingual digital learning materials for students in primary and secondary schools.
This package contains data used in the IWPT 2020 shared task. It contains training, development and test (evaluation) datasets. The data is based on a subset of Universal Dependencies release 2.5 (http://hdl.handle.net/11234/1-3105) but some treebanks contain additional enhanced annotations. Moreover, not all of these additions became part of Universal Dependencies release 2.6 (http://hdl.handle.net/11234/1-3226), which makes the shared task data unique and worth a separate release to enable later comparison with new parsing algorithms. The package also contains a number of Perl and Python scripts that have been used to process the data during preparation and during the shared task. Finally, the package includes the official primary submission of each team participating in the shared task.
This package contains data used in the IWPT 2021 shared task. It contains training, development and test (evaluation) datasets. The data is based on a subset of Universal Dependencies release 2.7 (http://hdl.handle.net/11234/1-3424) but some treebanks contain additional enhanced annotations. Moreover, not all of these additions became part of Universal Dependencies release 2.8 (http://hdl.handle.net/11234/1-3687), which makes the shared task data unique and worth a separate release to enable later comparison with new parsing algorithms. The package also contains a number of Perl and Python scripts that have been used to process the data during preparation and during the shared task. Finally, the package includes the official primary submission of each team participating in the shared task.