A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated release of the existing PDT-corpora of Czech data, uniformly annotated using the standard PDT scheme. PDT-corpora included in PDT-C: Prague Dependency Treebank (the original PDT contents, written newspaper and journal texts from three genres); Czech part of Prague Czech-English Dependency Treebank (translated financial texts, from English), Prague Dependency Treebank of Spoken Czech (spoken data, including audio and transcripts and multiple speech reconstruction annotation); PDT-Faust (user-generated texts). The difference from the separately published original treebanks can be briefly described as follows: it is published in one package, to allow easier data handling for all the datasets; the data is enhanced with a manual linguistic annotation at the morphological layer and new version of morphological dictionary is enclosed; a common valency lexicon for all four original parts is enclosed. Documentation provides two browsing and editing desktop tools (TrEd and MEd) and the corpus is also available online for searching using PML-TQ.
The Prague Dependency Treebank 3.5 is the 2018 edition of the core Prague Dependency Treebank (PDT). It contains all PDT annotation made at the Institute of Formal and Applied Linguistics under various projects between 1996 and 2018 on the original texts, i.e., all annotation from PDT 1.0, PDT 2.0, PDT 2.5, PDT 3.0, PDiT 1.0 and PDiT 2.0, plus corrections, new structure of basic documentation and new list of authors covering all previous editions. The Prague Dependency Treebank 3.5 (PDT 3.5) contains the same texts as the previous versions since 2.0; there are 49,431 annotated sentences (832,823 words) on all layers, from tectogrammatical annotation to syntax to morphology. There are additional annotated sentences for syntax and morphology; the totals for the lower layers of annotation are: 87,913 sentences with 1,502,976 words at the analytical layer (surface dependency syntax) and 115,844 sentences with 1,956,693 words at the morphological layer of annotation (these totals include the annotation with the higher layers annotated as well). Closely linked to the tectogrammatical layer is the annotation of sentence information structure, multiword expressions, coreference, bridging relations and discourse relations.
The latinpipe-evalatin24-240520 is a PhilBerta-based model for LatinPipe 2024 <https://github.com/ufal/evalatin2024-latinpipe>, performing tagging, lemmatization, and dependency parsing of Latin, based on the winning entry to the EvaLatin 2024 <https://circse.github.io/LT4HALA/2024/EvaLatin> shared task. It is released under the CC BY-NC-SA 4.0 license.
UDPipe is an trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files. UDPipe is language-agnostic and can be trained given only annotated data in CoNLL-U format. Trained models are provided for nearly all UD treebanks. UDPipe is available as a binary, as a library for C++, Python, Perl, Java, C#, and as a web service.
UDPipe is a free software under Mozilla Public License 2.0 (http://www.mozilla.org/MPL/2.0/) and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA (http://creativecommons.org/licenses/by-nc-sa/4.0/) license, although for some models the original data used to create the model may impose additional licensing conditions. UDPipe is versioned using Semantic Versioning (http://semver.org/).
UDPipe website http://ufal.mff.cuni.cz/udpipe contains download links of both the released packages and trained models, hosts documentation and offers online demo.
UDPipe development repository http://github.com/ufal/udpipe is hosted on GitHub.
Tokenizer, POS Tagger, Lemmatizer and Parser models for all Universal Depenencies 1.2 Treebanks, created solely using UD 1.2 data (http://hdl.handle.net/11234/1-1548).
To use these models, you need UDPipe binary, which you can download from http://ufal.mff.cuni.cz/udpipe.
Tokenizer, POS Tagger, Lemmatizer and Parser models for all 50 languages of Universal Depenencies 2.0 Treebanks, created solely using UD 2.0 data (http://hdl.handle.net/11234/1-1983). The model documentation including performance can be found at http://ufal.mff.cuni.cz/udpipe/users-manual#universal_dependencies_20_models .
To use these models, you need UDPipe binary version at least 1.2, which you can download from http://ufal.mff.cuni.cz/udpipe .
In addition to models itself, all additional data and value of hyperparameters used for training are available in the second archive, allowing reproducible training.
Tokenizer, POS Tagger, Lemmatizer and Parser models for 123 treebanks of 69 languages of Universal Depenencies 2.10 Treebanks, created solely using UD 2.10 data (https://hdl.handle.net/11234/1-4758). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_210_models .
To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
Tokenizer, POS Tagger, Lemmatizer and Parser models for 131 treebanks of 72 languages of Universal Depenencies 2.12 Treebanks, created solely using UD 2.12 data (https://hdl.handle.net/11234/1-5150). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_212_models .
To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
Tokenizer, POS Tagger, Lemmatizer and Parser models for 84 treebanks of 56 languages of Universal Depenencies 2.3 Treebanks, created solely using UD 2.3 data (http://hdl.handle.net/11234/1-2895). The model documentation including performance can be found at http://ufal.mff.cuni.cz/udpipe/models#universal_dependencies_23_models .
To use these models, you need UDPipe binary version at least 1.2, which you can download from http://ufal.mff.cuni.cz/udpipe .
In addition to models itself, all additional data and value of hyperparameters used for training are available in the second archive, allowing reproducible training.
Tokenizer, POS Tagger, Lemmatizer and Parser models for 90 treebanks of 60 languages of Universal Depenencies 2.4 Treebanks, created solely using UD 2.4 data (http://hdl.handle.net/11234/1-2988). The model documentation including performance can be found at http://ufal.mff.cuni.cz/udpipe/models#universal_dependencies_24_models .
To use these models, you need UDPipe binary version at least 1.2, which you can download from http://ufal.mff.cuni.cz/udpipe .
In addition to models itself, all additional data and value of hyperparameters used for training are available in the second archive, allowing reproducible training.