Relationship extraction models for the Czech language. Models are trained on CERED (dataset created by distant supervision on Czech Wikipedia and Wikidata) and recognize a subset of Wikidata relations (listed in CEREDx.LABELS).
We supply a demo.py that performs inference on user-defined input and requirements.txt file for pip. Adapt the demo code to use the model.
Both the dataset and the models are presented in Relationship Extraction thesis.
Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts in 45 languages, generated by UDPipe (http://ufal.mff.cuni.cz/udpipe), together with word embeddings of dimension 100 computed from lowercased texts by word2vec (https://code.google.com/archive/p/word2vec/).
For each language, automatic annotations in CoNLL-U format are provided in a separate archive. The word embeddings for all languages are distributed in one archive.
Note that the CC BY-SA-NC 4.0 license applies to the automatically generated annotations and word embeddings, not to the underlying data, which may have different license and impose additional restrictions.
Update 2018-09-03
===============
Added data in the 4 “surprise languages” from the 2017 ST: Buryat, Kurmanji, North Sami and Upper Sorbian. This has been promised before, during CoNLL-ST 2018 we gave the participants a link to this record saying the data was here. It wasn't, sorry. But now it is.
Czech models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging.
The morphological dictionary is created from MorfFlex CZ 161115 and DeriNet 1.2 and the PoS tagger is trained on Prague Dependency Treebank 3.0 (PDT). and This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013).
The Czech morphologic system was devised by Jan Hajič.
The MorfFlex CZ dictionary was created by Jan Hajič and Jaroslava Hlaváčová.
The morphologic guesser research was supported by the projects 1ET101120503 and 1ET101120413 of Academy of Sciences of the Czech Republic and 100008/2008 of Charles University Grant Agency. The research was performed by Jan Hajič, Jaroslava Hlaváčová and David Kolovratník.
The tagger algorithm and feature set research was supported by the projects MSM0021620838 and LC536 of Ministry of Education, Youth and Sports of the Czech Republic, GA405/09/0278 of the Grant Agency of the Czech Republic and 1ET101120503 of Academy of Sciences of the Czech Republic. The research was performed by Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab and Miroslav Spousta.
The tagger is trained on morphological layer of Prague Dependency Treebank PDT 2.5, which was supported by the projects LM2010013, LC536, LN00A063 and MSM0021620838 of Ministry of Education, Youth and Sports of the Czech Republic, and developed by Martin Buben, Jan Hajič, Jiří Hana, Hana Hanová, Barbora Hladká, Emil Jeřábek, Lenka Kebortová, Kristýna Kupková, Pavel Květoň, Jiří Mírovský, Andrea Pfimpfrová, Jan Štěpánek and Daniel Zeman.
Czech models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging.
The morphological dictionary is created from MorfFlex CZ 2.0, DeriNet 2.1 and the PoS tagger is trained on Prague Dependency Treebank - Consolidated 1.0. and This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013).
The Czech morphologic system was devised by Jan Hajič.
The MorfFlex CZ dictionary was created by Jan Hajič and Jaroslava Hlaváčová.
The morphologic guesser research was supported by the projects 1ET101120503 and 1ET101120413 of Academy of Sciences of the Czech Republic and 100008/2008 of Charles University Grant Agency. The research was performed by Jan Hajič, Jaroslava Hlaváčová and David Kolovratník.
The tagger algorithm and feature set research was supported by the projects MSM0021620838 and LC536 of Ministry of Education, Youth and Sports of the Czech Republic, GA405/09/0278 of the Grant Agency of the Czech Republic and 1ET101120503 of Academy of Sciences of the Czech Republic. The research was performed by Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab and Miroslav Spousta.
The tagger is trained on morphological layer of Prague Dependency Treebank PDT 2.5, which was supported by the projects LM2010013, LC536, LN00A063 and MSM0021620838 of Ministry of Education, Youth and Sports of the Czech Republic, and developed by Martin Buben, Jan Hajič, Jiří Hana, Hana Hanová, Barbora Hladká, Emil Jeřábek, Lenka Kebortová, Kristýna Kupková, Pavel Květoň, Jiří Mírovský, Andrea Pfimpfrová, Jan Štěpánek and Daniel Zeman.
The dataset contains delimitation of borders of dialect regions, subgroups, areas and types in the Czech Republic. It is the result of an extensive expert revision that was based on various sources and made the delimitation exact and accurate. At the same time, the dataset corresponds to the underlying data of the Mapka application running at https://korpus.cz/mapka/
There are four files in this submission. Two files contain the delimitation of dialect regions ("oblasti"; both in GeoJSON and Shapefile formats) and two files contain the delimitation of smaller dialect areas, i.e. subgroups, areas and types ("oblasti_jemne"; again in GeoJSON and Shapefile formats).
English model for NameTag, a named entity recognition tool. The model is trained on CoNLL-2003 training data. Recognizes PER, ORG, LOC and MISC named entities. Achieves F-measure 84.73 on CoNLL-2003 test data.
The FERNET-C5 is a monolingual BERT language representation model trained from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - a Czech mutation of the English C4 dataset. The training data contained almost 13 billion words (93 GB of text data). The model has the same architecture as the original BERT model, i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons. In contrast to Google’s BERT models, we used SentencePiece tokenization instead of the Google’s internal WordPiece tokenization.
More details can be found in README.txt. Yet more detailed description is available in https://arxiv.org/abs/2107.10042
The same models are also released at https://huggingface.co/fav-kky/FERNET-C5
The book [1] contains spelling rules classified into ten categories, each category containing many rules. This XML file presents our implemented rules classified with six category tags, as is the case in the book. We implemented 24 rules since the remaining rules require diacritical and morphological analysis that are outside the scope of our present work.
References:
[1] Dr.Fahmy Al-Najjar, 'Spelling rules in ten easy lessons', Al Kawthar Library,2008. Available: https://www.alukah.net/library/0/53498/%D9%82%D9%88%D8%A7%D8%B9%D8%AF-%D8%A7%D9%84%D8%A5%D9%85%D9%84%D8%A7%D8%A1-%D9%81%D9%8A-%D8%B9%D8%B4%D8%B1%D8%A9-%D8%AF%D8%B1%D9%88%D8%B3-%D8%B3%D9%87%D9%84%D8%A9-pdf/
NER models for NameTag 2, named entity recognition tool, for English, German, Dutch, Spanish and Czech. Model documentation including performance can be found here: https://ufal.mff.cuni.cz/nametag/2/models . These models are for NameTag 2, named entity recognition tool, which can be found here: https://ufal.mff.cuni.cz/nametag/2 .
NER models for NameTag 2, named entity recognition tool, for English, German, Dutch, Spanish and Czech. Model documentation including performance can be found here: https://ufal.mff.cuni.cz/nametag/2/models . These models are for NameTag 2, named entity recognition tool, which can be found here: https://ufal.mff.cuni.cz/nametag/2 .