This package contains data sets for development and testing of machine translation of medical search short queries between Czech, English, French, and German. The queries come from general public and medical experts. and This work was supported by the EU FP7 project Khresmoi (European Comission contract No. 257528). The language resources are distributed by the LINDAT/Clarin project of the Ministry of Education, Youth and Sports of the Czech Republic (project no. LM2010013).
We thank Health on the Net Foundation for granting the license for the English general public queries, TRIP database for granting the license for the English medical expert queries, and three anonymous translators and three medical experts for translating amd revising the data.
This package contains data sets for development and testing of machine translation of medical queries between Czech, English, French, German, Hungarian, Polish, Spanish ans Swedish. The queries come from general public and medical experts. This is version 2.0 extending the previous version by adding Hungarian, Polish, Spanish, and Swedish translations.
This package contains data sets for development and testing of machine translation of sentences from summaries of medical articles between Czech, English, French, and German. and This work was supported by the EU FP7 project Khresmoi (European Comission contract No. 257528). The language resources are distributed by the LINDAT/Clarin project of the Ministry of Education, Youth and Sports of the Czech Republic (project no. LM2010013). We thank all the data providers and copyright holders for providing the source data and anonymous experts for translating the sentences.
This package contains data sets for development (Section dev) and testing (Section test) of machine translation of sentences from summaries of medical articles between Czech, English, French, German, Hungarian, Polish, Spanish
and Swedish. Version 2.0 extends the previous version by adding Hungarian, Polish, Spanish, and Swedish translations.
An interactive web demo for querying selected ÚFAL and LINDAT corpora. LINDAT/CLARIN KonText is a fork of ÚČNK KonText (https://github.com/czcorpus/kontext, maintained by Tomáš Machálek) that contains some modifications and additional features. Kontext, in turn, is a fork of the Bonito 2.68 python web interface to the corpus management tool Manatee (http://nlp.fi.muni.cz/trac/noske, created by Pavel Rychlý).
Korektor is a statistical spell-checker and (occasionally) grammar-checker. It is released under 2-Clause BSD license http://opensource.org/licenses/BSD-2-Clause.
Korektor started with Michal Richter's diploma thesis Advanced Czech Spellchecker https://redmine.ms.mff.cuni.cz/documents/1, but it is being developed further. There are two versions: a command line utility (tested on Linux, Windows and OS X) and a REST service with publicly available API http://lindat.mff.cuni.cz/services/korektor/api-reference.php and HTML front end https://lindat.mff.cuni.cz/services/korektor/.
"Large Scale Colloquial Persian Dataset" (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. LSCP includes 120M sentences from 27M casual Persian tweets with its dependency relations in syntactic annotation, Part-of-speech tags, sentiment polarity and automatic translation of original Persian sentences in five different languages (EN, CS, DE, IT, HI).
Lingua::Interset is a universal morphosyntactic feature set to which all tagsets of all corpora/languages can be mapped. Version 2.026 covers 37 different tagsets of 21 languages. Limited support of the older drivers for other languages (which are not included in this package but are available for download elsewhere) is also available; these will be fully ported to Interset 2 in future.
Interset is implemented as Perl libraries. It is also available via CPAN.
This toolkit comprises the tools and supporting scripts for unsupervised induction of dependency trees from raw texts or texts with already assigned part-of-speech tags. There are also scripts for simple machine translation based on unsupervised parsing and scripts for minimally supervised parsing into Universal-Dependencies style.
The collection comprises the relevance judgments used in the 2023 LongEval Information Retrieval Lab (https://clef-longeval.github.io/), organized at CLEF. It consists of three sets of relevance judgments:
1) Relevance judgments for the heldout queries from the LongEval Train Collection (http://hdl.handle.net/11234/1-5010).
2) Relevance judgments for the short-term persistence (sub-task A) queries from the LongEval Test Collection (http://hdl.handle.net/11234/1-5139).
3) Relevance judgments for the long-term persistence (sub-task B) queries from the LongEval Test Collection (http://hdl.handle.net/11234/1-5139).
These judgments were provided by the Qwant search engine (https://www.qwant.com) and were generated using a click model. The click model output was based on the clicks of Qwant's users, but it mitigates noise from raw user clicks caused by positional bias and also better safeguards users' privacy. Consequently, it can serve as a reliable soft relevance estimate for evaluating and training models.
The collection includes a total of 1,420 judgments for the heldout queries, with 74 considered highly relevant and 326 deemed relevant. For the short-term sub-task queries, there are 12,217 judgments, including 762 highly relevant and 2,608 relevant ones. As for the long-term sub-task queries, there are 13,467 judgments, with 936 being highly relevant and 2,899 relevant.