Number of results to display per page
Search Results
202. CERED baseline models
- Creator:
- Šimečková, Zuzana and Straka, Milan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- mlmodel, text, and languageDescription
- Subject:
- relationship extraction
- Language:
- Czech
- Description:
- Relationship extraction models for the Czech language. Models are trained on CERED (dataset created by distant supervision on Czech Wikipedia and Wikidata) and recognize a subset of Wikidata relations (listed in CEREDx.LABELS). We supply a demo.py that performs inference on user-defined input and requirements.txt file for pip. Adapt the demo code to use the model. Both the dataset and the models are presented in Relationship Extraction thesis.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), PUB, and http://creativecommons.org/licenses/by-nc-sa/4.0/
203. Ceremonial unveiling of the Ernst Denis Memorial
- Creator:
- Deglové
- Publisher:
- Národní filmový archiv
- Type:
- video and clip
- Subject:
- pomník Denis Ernest maketa, pomník Denis Ernest odhalení, stavba podstavce pomníku, sochař při práci, Sokolové na shromáždění slavnostním, shromáždění slavnostní, ateliér sochařský, generál československý, generál francouzský, projev veřejný, pomník Dvořák Karel, výročí vznik ČSR 10., Vznik ČSR, Places::Praha::Malá Strana::Malostranské náměstí::pomník Ernsta Denise, Places::Praha::Janáčkovo nábřeží::ateliér Karla Dvořáka /int./, People::Hodža Milan (1878-1944), People::Baxa Karel (1862-1938), People::Syrový Jan (1888-1970), People::Beneš Edvard (1884-1948), and People::Masaryk Tomáš Garrigue (1850-1937)
- Language:
- No linguistic content
- Description:
- The segment captures events preceding the installation and subsequent unveiling of the memorial statue of French historian and Slavonic scholar Ernest Denis on Lesser Town Square in Prague. Members of the Commission for the Construction of Denis´s Memorial use a maquette to find the best place for the statue. The event is witnessed by the artist, the sculptor Karel Dvořák. A shot of Dvořák in his studio in the courtyard of a house on Janáček Embankment in Smíchov, Prague. Digging works for the pedestal followed by the unveiling of the memorial on 27 October 1928, the eve of the tenth anniversary of the Czechoslovak Republic. President Tomáš Garrigue Masaryk, Minister of Education Milan Hodža, Prague Mayor Karel Baxa, General Jan Syrový, MP Antonín Uhlíř, French General Eugene Mittelhauser, French politician Alfred Oberkirch and others are present on the grandstand. Speech by Minister of Foreign Affairs Edvard Beneš. An image of the President T. G. Masaryk and Edvard Beneš.
- Rights:
- http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
204. Česílko
- Creator:
- Hajič, Jan, Kuboň, Vladislav, and Homola, Petr
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- toolService
- Subject:
- machine translation and Czech-Slovak translation
- Language:
- Czech
- Description:
- Česílko is a tool enabling the fast and efficient translation from one source language into many target languages, which are mutually related.
- Rights:
- Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB
205. Čestmír Loukotka (philologist)
- Creator:
- Veselý, Bohumil
- Publisher:
- Národní filmový archiv
- Type:
- video and clip
- Subject:
- Galerie osobností, Places::Praha::Nové Město::Školská::pavlač Bohumila Veselého, and People::Loukotka Čestmír (1895-1966)
- Language:
- No linguistic content
- Description:
- Philologist Čestmír Loukotka on Bohumil Veselý's balcony.
- Rights:
- http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
206. Chared
- Creator:
- Pomikálek, Jan
- Publisher:
- Masaryk University, NLP Centre
- Type:
- toolService and tool
- Subject:
- character encoding, character encoding detection, charset, and unicode
- Language:
- English
- Description:
- Chared is a software tool which can detect character encoding of a text document provided the language of the document is known. The language of the text has to be specified as an input parameter so that the corresponding language model can be used. The package contains models for a wide range of languages (currently 57 --- covering all major languages). Furthermore, it provides a training script to learn models for additional languages using a set of user supplied sample html pages in the given language. The detection algorithm is based on determining similarity of byte trigrams vectors. In general, chared should be more accurate than other character encoding detection tools with no language constraints. This is an important advantage allowing precise character decoding needed for building large textual corpora. The tool has been used for building corpora in American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting of 70 billions tokens altogether. Chared is an open source software, licensed under New BSD License and available for download (including the source code) at http://code.google.com/p/chared/. The research leading to this piece of software was published in POMIKÁLEK, Jan a Vít SUCHOMEL. chared: Character Encoding Detection with a Known Language. In Aleš Horák, Pavel Rychlý. RASLAN 2011. 5. vyd. Brno, Czech Republic: Tribun EU, 2011. od s. 125-129, 5 s. ISBN 978-80-263-0077-9. and PRESEMT, Lexical Computing Ltd
- Rights:
- BSD 3-Clause "New" or "Revised" license, http://opensource.org/licenses/BSD-3-Clause, and PUB
207. Children Playing in Gas Mask in Prague
- Creator:
- Aktualita
- Publisher:
- Národní filmový archiv
- Type:
- video and clip
- Subject:
- děti v plynových maskách, masky plynové, koloběžka, tříkolka, hry dětské v maskách, Mnichovská dohoda, and Český zvukový týdeník Aktualita::1938/33
- Language:
- Czech
- Description:
- The segment of Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel), 1938, issue no. 33 shows children playing in gas masks in Prague.
- Rights:
- http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
208. Christmas exhibition in the hall of the Black Rose Palace
- Creator:
- Aktualita
- Publisher:
- Národní filmový archiv
- Type:
- video and clip
- Subject:
- stromek vánoční, hvězda vánoční, znak Kuratorium pro výchovu mládeže, akce Kuratorium pro výchovu mládeže, Kuratorium pro výchovu mládeže akce, dárky vánoční pro děti, dívky v krojích, kroje lidové, hračky rozdávání, Kuratorium, Places::Praha::Nové Město:: Na Příkopě::palác Černá růže /int./, People::Moravec Emanuel (1893-1945), People::Teuner František (1911-1978), and Český zvukový týdeník Aktualita::1944
- Language:
- Czech
- Description:
- Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 1A from 1944 was shot during a Christmas exhibition organised by the Board of Trustees for the Education of Youth and held in the hall of the Black Rose Palace in Na Příkopě Street in Prague from 18 to 22 December. The exhibition included a display of the 500 prettiest toys made as part of the Sewing Dolls initiative. Girls made 59,000 dolls, out of which 44,000 went to the children of the labourers working in the Reich and 15,000 to the children of the German soldiers fighting on the front. The exhibition was toured by Minister of Education and People´s Enlightenment and Chairman of the Board Emanuel Moravec and General Secretary of the Board František Teuner.
- Rights:
- http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
209. CLEF-TREC Q/A
- Creator:
- Abouenour, lahcen, Bouzoubaa, Karim, and Rosso, paolo
- Publisher:
- ALELM
- Type:
- text, other, and lexicalConceptualResource
- Subject:
- CLEF and TREC
- Language:
- Arabic
- Description:
- List of 2264 questions + answers of CLEF and TREC, translated to Arabic
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
210. CoCzeFLA Chroma 2022.07
- Creator:
- Chromá, Anna and Matiasovitsová, Klára
- Publisher:
- Faculty of Arts, Charles Univesity
- Type:
- text and corpus
- Subject:
- first language acquisition, typical development, and monolingual corpus
- Language:
- Czech
- Description:
- Transcripts of longitudinal audio recordings of 7 Czech typical monolingual children between 1;7 to 3;9. Files are in plain text with UTF-8 encoding. Each file represents one recording session of one of the target children and is named with the presudonym of the child and her age at the given session in form YMMDD. Transcription rules and other details are to find on the homepage coczefla.ff.cuni.cz.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
211. CoCzeFLA Chroma 2023.04
- Creator:
- Chromá, Anna, Matiasovitsová, Klára, Sláma, Jakub, and Treichelová, Jolana
- Publisher:
- Charles University, Faculty of Arts
- Type:
- text and corpus
- Subject:
- first language acquisition, typical development, longitudinal corpus, and Czech
- Language:
- Czech
- Description:
- A new version of the previously published corpus Chroma. The version 2023.04 includes six children. Two transcripts (Julie20221, Klara30424) were removed since they did not meet the criteria on the dialogical format. The transcripts were revised (eliminating typing errors and inconsistencies in the transcription format) and morphologically annotated by the automatic tool MorphoDiTa. Detailed manual control of the annotation was performed on children's utterances; the annotation of adult data was not checked yet. Files are in plain text with UTF-8 encoding. Each file represents one recording session of one of the target children and is named with the alias of the child and their age at the given session in form YMMDD. Transcription rules and other details can be found on the homepage coczefla.ff.cuni.cz.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
212. CoCzeFLA Chroma 2023.07
- Creator:
- Chromá, Anna, Sláma, Jakub, Matiasovitsová, Klára, and Kohoutková, Jolana
- Publisher:
- Charles University, Faculty of Arts
- Type:
- text and corpus
- Subject:
- first language acquisition, typical development, longitudinal corpus, and Czech
- Language:
- Czech
- Description:
- A new version of the previously published corpus Chroma wih morphological annotation. The version 2023.07 differs from 2023.04 in that it includes all seven children and it went through an additional careful check of consistency and conformity to the CHAT transcription principles. Two transcripts (Julie20221, Klara30424) from the previous versions (2022.07, 2019.07) were removed since they did not meet our criteria on dialogical format. All transcripts of recordings made during one day were split into one file. Thus, version 2023.07 consists of 183 files/transcripts. The number of utterances and tokens given here in LINDAT corresponds to children's lines only. Files are in plain text with UTF-8 encoding. Each file represents one recording session of one of the target children and is named with the alias of the child and their age at the given session in form YMMDD. Transcription rules and other details can be found on the homepage coczefla.ff.cuni.cz.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
213. Come Forward and Give Blood!
- Creator:
- Aktualita
- Publisher:
- Národní filmový archiv
- Type:
- video and clip
- Subject:
- krev odběr, krev darování, agitace dárcovství krve, Československý červený kříž, sestry Červeného kříže, konzervy krevní, akce Československý Červený kříž, Mnichovská dohoda, and Československý zvukový týdeník Aktualita::1938/39
- Language:
- Czech
- Description:
- The segment of Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel), 1938, issue no. 39 appeals to the public to donate blood in preparation for the expected military conflict. It includes illustrative shots of how donated blood is preserved for use in the combat environment. The report includes information about different blood groups and how healthy blood donors are tested by the Czechoslovak Red Cross.
- Rights:
- http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
214. Commemorative ceremony for the head of RAD in the Protectorate...
- Creator:
- Aktualita
- Publisher:
- Národní filmový archiv
- Type:
- video and clip
- Subject:
- pohřeb Commichau Alexandr, akt pietní Commichau Alexandr, vyznamenání vojenská, projev smuteční, kříže hákové, rakev na katafalku, průvod pohřební, hajlování, znak Říšská pracovní služba, věnec smuteční kladení, oddíly Říšká pracovní služba na pohřbu, vůz pohřební, vojáci němečtí, Významné pohřby, Places::Praha::Hradčany::Pražský hrad::Španělský sál, Places::Praha::Hradčany::Pražský hrad::první hradní nádvoří, Places::Praha::Hradčany::Pražský hrad::Matyášova brána, People::Decker Wilhelm (1899-1945), People::Commichau Margarethe (1910-), People::Frank Karl Hermann (1898-1946), People::Daluege Kurt (1897-1946), and Český zvukový týdeník Aktualita::1942/41A
- Language:
- Czech
- Description:
- Segment of the Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) 1942 No. 41 captures the funeral of Alexandr Commichau, head of the Reich Labour Service, held on 5 October 1942 in the Spanish Hall of Prague Castle, decorated with Nazi emblems for the occassion. The deceased´s military honours are on display next to a bier with the coffin. Mourners include the widow, a little girl and State Secretary Hermann Frank. Acting Reich Protector Kurt Daluege lays down a wreath from Adolf Hitler. The funeral speech is delivered by Deputy Chief General of RAD Wilhelm Decker (silent). The procession with the coffin, which will be transported to a crematorium in Prague, moves through the Matthias Gate.
- Rights:
- http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
215. Concert to Honour State President Emil Hácha
- Creator:
- Aktualita
- Publisher:
- Národní filmový archiv
- Type:
- video and clip
- Subject:
- akce Kuratorium pro výchovu mládeže, Kuratorium pro výchovu mládeže akce, koncert pěvecký, orchestr Česká filharmonie, výročí Hácha Emil prezident 5., sbor pěvecký Kühnův dětský sbor, sbor pěvecký Český pěvecký sbor, lóže divadelní, sbormistr, dirigent, vlajka s hákovým křížem, Kuratorium, Places::Praha::Staré Město::náměstí Republiky::Obecní dům::Smetanova síň, and Český zvukový týdeník Aktualita::1943/49
- Language:
- Czech
- Description:
- Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 49AB from 1943 captures the concert called Five Years Leading the Nation, which was organised by the Board of Trustees for the Education of Youth to mark the fifth anniversary of Emil Hácha´s presidency, and held at Smetana Hall in the Municipal House in Prague on 29 November. General Secretary of the Board František Teuner gave a speech at the formal event. The programme included a selection of folk songs by Otakar Jeremiáš performed by the Czech Choir and the Kühn Children´s Choir accompanied by the Czech Philharmonic under the baton of Karel Šejna.
- Rights:
- http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
216. Concert to Mark the 70th Birth Anniversary of the Late Josef Suk
- Creator:
- Aktualita
- Publisher:
- Národní filmový archiv
- Type:
- video and clip
- Subject:
- akce Kuratorium pro výchovu mládeže, Kuratorium pro výchovu mládeže akce, orchestr symfonický, orchestr Česká filharmonie, dirigent orchestru symfonického, mládež na koncertě, sál koncertní vyzdobený, vlajka s hákovým křížem, vlajka česká protektorátní, mládež tleskající, orlice říšskoněmecká, Kuratorium, Places::Praha::Staré Město::náměstí Republiky::Obecní dům::Smetanova síň, People::Pařík Otakar (1901-1955), and Český zvukový týdeník Aktualita::1944/42AB
- Language:
- Czech
- Description:
- Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 42A from 1944 was shot during a concert organised by the Board of Trustees for the Education of Youth and held in the Smetana Hall of the Municipal House in Prague on 3 October. The concert was dedicated to mark the 70th birth anniversary of the late composer Josef Suk. The programme, prepared by the Czech Philharmonic led by conductor Otakar Pařík, included the symphonic poem "Praga".
- Rights:
- http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
217. CoNLL 2009 Shared Task - Czech Data
- Creator:
- Hajič, Jan, Straňák, Pavel, and Štěpánek, Jan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- conll-st and treebank
- Language:
- Czech
- Description:
- Czech data - both train and test+eval sets, as well as the valency dictionary - for the CoNLL 2009 Shared Task. Documentation is included. The data are generated from PDT 2.0. LDC catalog number: LDC2009E34B and MSM 0021620838 (http://ufal.mff.cuni.cz:8080/bib/?section=grant&id=116488695895567&mode=view)
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
218. CoNLL 2009 Shared Task Czech Trial Set
- Creator:
- Hajič, Jan, Straňák, Pavel, and Štěpánek, Jan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- conll-st
- Language:
- Czech
- Description:
- Czech trial (example) data for CoNLL 2009 Shared Task. The data are generated from PDT 2.0. LDC2009E32B and MSM 0021620838 (http://ufal.mff.cuni.cz:8080/bib/?section=grant&id=116488695895567&mode=view)
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
219. CoNLL 2017 and 2018 Shared Task Blind and Preprocessed Test Data
- Creator:
- Zeman, Daniel and Straka, Milan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- tokenization, word segmentation, morphology, tagging, syntax, parsing, and universal dependencies
- Language:
- Afrikaans, Arabic, Breton, Bulgarian, Russia Buriat, Catalan, Czech, Church Slavic, Danish, German, Modern Greek (1453-), English, Estonian, Basque, Faroese, Persian, Finnish, French, Old French (842-ca. 1400), Irish, Galician, Gothic, Ancient Greek (to 1453), Hebrew, Hindi, Croatian, Upper Sorbian, Hungarian, Armenian, Indonesian, Italian, Japanese, Kazakh, Northern Kurdish, Korean, Latin, Latvian, Dutch, Norwegian, Nigerian Pidgin, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Northern Sami, Spanish, Serbian, Swedish, Thai, Turkish, Uighur, Ukrainian, Urdu, Vietnamese, and Chinese
- Description:
- CoNLL 2017 and 2018 shared tasks: Multilingual Parsing from Raw Text to Universal Dependencies This package contains the test data in the form in which they ware presented to the participating systems: raw text files and files preprocessed by UDPipe. The metadata.json files contain lists of files to process and to output; README files in the respective folders describe the syntax of metadata.json. For full training, development and gold standard test data, see Universal Dependencies 2.0 (CoNLL 2017) Universal Dependencies 2.2 (CoNLL 2018) See the download links at http://universaldependencies.org/. For more information on the shared tasks, see http://universaldependencies.org/conll17/ http://universaldependencies.org/conll18/ Contents: conll17-ud-test-2017-05-09 ... CoNLL 2017 test data conll18-ud-test-2018-05-06 ... CoNLL 2018 test data conll18-ud-test-2018-05-06-for-conll17 ... CoNLL 2018 test data with metadata and filenames modified so that it is digestible by the 2017 systems.
- Rights:
- Licence Universal Dependencies v2.2, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.2, and PUB
220. CoNLL 2017 Shared Task - Automatically Annotated Raw Texts and Word Embeddings
- Creator:
- Ginter, Filip, Hajič, Jan, Luotolahti, Juhani, Straka, Milan, and Zeman, Daniel
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, mlmodel, and languageDescription
- Subject:
- CoNLL 2017, word embeddings, and automatic annotation
- Language:
- Multiple languages
- Description:
- Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts in 45 languages, generated by UDPipe (http://ufal.mff.cuni.cz/udpipe), together with word embeddings of dimension 100 computed from lowercased texts by word2vec (https://code.google.com/archive/p/word2vec/). For each language, automatic annotations in CoNLL-U format are provided in a separate archive. The word embeddings for all languages are distributed in one archive. Note that the CC BY-SA-NC 4.0 license applies to the automatically generated annotations and word embeddings, not to the underlying data, which may have different license and impose additional restrictions. Update 2018-09-03 =============== Added data in the 4 “surprise languages” from the 2017 ST: Buryat, Kurmanji, North Sami and Upper Sorbian. This has been promised before, during CoNLL-ST 2018 we gave the participants a link to this record saying the data was here. It wasn't, sorry. But now it is.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
221. CoNLL 2017 Shared Task - UDPipe Baseline Models and Supplementary Materials
- Creator:
- Straka, Milan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, mlmodel, and languageDescription
- Subject:
- CoNLL 2017, tokenizer, POS tagger, lemmatization, tagger, parser, dependency parser, morphology, and treebank
- Language:
- Multiple languages
- Description:
- Baseline UDPipe models for CoNLL 2017 Shared Task in UD Parsing, and supplementary material. The models require UDPipe version at least 1.1 and are evaluated using the official evaluation script. The models are trained on a slightly different split of the official UD 2.0 CoNLL 2017 training data, so called baselinemodel split, in order to allow comparison of models even during the shared task. This baselinemodel split of UD 2.0 CoNLL 2017 training data is available for download. Furthermore, we also provide UD 2.0 CoNLL 2017 training data with automatically predicted morphology. We utilize the baseline models on development data and perform 10-fold jack-knifing (each fold is predicted with a model trained on the rest of the folds) on the training data. Finally, we supply all required data and hyperparameter values needed to replicate the baseline models.
- Rights:
- Licence Universal Dependencies v2.0, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.0, and PUB
222. CoNLL 2017 Shared Task System Outputs
- Creator:
- Zeman, Daniel, Potthast, Martin, Straka, Milan, Popel, Martin, Dozat, Timothy, Qi, Peng, Manning, Christopher, Shi, Tianze, Wu, Felix G., Chen, Xilun, Cheng, Yao, Björkelund, Anders, Falenska, Agnieszka, Yu, Xiang, Kuhn, Jonas, Che, Wanxiang, Guo, Jiang, Wang, Yuxuan, Zheng, Bo, Zhao, Huaipeng, Liu, Yang, Teng, Dechuan, Liu, Ting, Lim, Kyungtae, Poibeau, Thierry, Sato, Motoki, Manabe, Hitoshi, Noji, Hiroshi, Matsumoto, Yuji, Kırnap, Ömer, Önder, Berkay Furkan, Yuret, Deniz, Straková, Jana, Vania, Clara, Zhang, Xingxing, Lopez, Adam, Heinecke, Johannes, Asadullah, Munshi, Kanerva, Jenna, Luotolahti, Juhani, Ginter, Filip, Kuan, Yu, Sofroniev, Pavel, Schill, Erik, Hinrichs, Erhard, Nguyen, Dat Quoc, Dras, Mark, Johnson, Mark, Qian, Xian, Vilares, David, Gómez-Rodríguez, Carlos, Aufrant, Lauriane, Wisniewski, Guillaume, Yvon, François, Dumitrescu, Stefan Daniel, Boroş, Tiberiu, Tufiş, Dan, Das, Ayan, Zaffar, Affan, Sarkar, Sudeshna, Wang, Hao, Zhao, Hai, Zhang, Zhisong, Hornby, Ryan, Taylor, Clark, Park, Jungyeul, de Lhoneux, Miryam, Shao, Yan, Basirat, Ali, Kiperwasser, Eliyahu, Stymne, Sara, Goldberg, Yoav, Nivre, Joakim, Akkuş, Burak Kerim, Azizoglu, Heval, Cakici, Ruket, Moor, Christophe, Merlo, Paola, Henderson, James, Wang, Haozhou, Ji, Tao, Wu, Yuanbin, Lan, Man, de la Clergerie, Eric, Sagot, Benoît, Seddah, Djamé, More, Amir, Tsarfaty, Reut, Kanayama, Hiroshi, Muraoka, Masayasu, Yoshikawa, Katsumasa, Garcia, Marcos, and Gamallo, Pablo
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- dependency parser and parsebank
- Language:
- Arabic, Bulgarian, Russia Buriat, Czech, Catalan, Church Slavic, Danish, German, Modern Greek (1453-), English, Spanish, Estonian, Basque, Persian, Finnish, French, Irish, Galician, Gothic, Ancient Greek (to 1453), Hebrew, Hindi, Croatian, Upper Sorbian, Hungarian, Indonesian, Italian, Japanese, Kazakh, Northern Kurdish, Korean, Latin, Latvian, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Northern Sami, Swedish, Turkish, Uighur, Ukrainian, Urdu, Vietnamese, and Chinese
- Description:
- This package contains the system outputs from the CoNLL 2017 Shared Task in Multilingual Parsing from Raw Text to Universal Dependencies.
- Rights:
- Licence Universal Dependencies v2.0, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.0, and PUB
223. CoNLL 2018 Shared Task - UDPipe Baseline Models and Supplementary Materials
- Creator:
- Straka, Milan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, mlmodel, and languageDescription
- Subject:
- CoNLL 2018, tokenizer, POS tagger, lemmatization, tagger, parser, dependency parser, morphology, and treebank
- Language:
- Multiple languages
- Description:
- Baseline UDPipe models for CoNLL 2018 Shared Task in UD Parsing, and supplementary material. The models require UDPipe version at least 1.2 and are evaluated using the official evaluation script. The models were trained using a custom data split for treebanks where no development data is provided. Also, we trained an additional "Mixed" model, which uses 200 sentences from every training data. All information needed to replicate the model training (hyperparameters, modified train-dev split, and pre-computed word embeddings for the parser) are included in the archive. Additionaly, we provide UD 2.2 CoNLL 2018 training data with automatically predicted morphology. We utilize the baseline models on development data and perform 10-fold jack-knifing (each fold is predicted with a model trained on the rest of the folds) on the training data.
- Rights:
- Licence Universal Dependencies v2.2, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.2, and PUB
224. CoNLL 2018 Shared Task System Outputs
- Creator:
- Zeman, Daniel, Potthast, Martin, Duthoo, Elie, Mesnard, Olivier, Rybak, Piotr, Wróblewska, Alina, Che, Wanxiang, Liu, Yijia, Wang, Yuxuan, Zheng, Bo, Liu, Ting, Li, Zuchao, He, Shexia, Zhang, Zhuosheng, Zhao, Hai, Wu, Yingting, Tong, Jia-Jun, Nguyen, Dat Quoc, Verspoor, Karin, Wan, Hui, Naseem, Tahira, Lee, Young-Suk, Castelli, Vittorio, Ballesteros, Miguel, Hershcovich, Daniel, Abend, Omri, Rappoport, Ari, Smith, Aaron, Bohnet, Bernd, de Lhoneux, Miryam, Nivre, Joakim, Shao, Yan, Stymne, Sara, Kırnap, Ömer, Dayanık, Erenay, Yuret, Deniz, Kanerva, Jenna, Ginter, Filip, Miekka, Niko, Leino, Akseli, Salakoski, Tapio, Lim, KyungTae, Park, Cheoneum, Lee, Changki, Poibeau, Thierry, Bhat, Riyaz Ahmad, Bhat, Irshad, Bangalore, Srinivas, Qi, Peng, Dozat, Timothy, Zhang, Yuhao, Manning, Christopher, Boroș, Tiberiu, Dumitrescu, Stefan Daniel, Burtica, Ruxandra, Arakelyan, Gor, Hambardzumyan, Karen, Khachatrian, Hrant, Rosa, Rudolf, Mareček, David, Straka, Milan, Seker, Amit, More, Amir, Tsarfaty, Reut, Önder, Berkay Furkan, Gümeli, Can, Jawahar, Ganesh, Muller, Benjamin, Fethi, Amal, Martin, Louis, Villemonte de la Clergerie, Eric, Sagot, Benoît, Seddah, Djamé, Özateş, Şaziye Betül, Özgür, Arzucan, Gungor, Tunga, Öztürk, Balkız, Ji, Tao, Liu, Yufang, Wang, Yijun, Wu, Yuanbin, Lan, Man, Chen, Danlu, Lin, Mengxiao, Hu, Zhifeng, and Qiu, Xipeng
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- parsed data, conllu, and universal dependencies
- Language:
- Afrikaans, Arabic, Breton, Bulgarian, Russia Buriat, Catalan, Czech, Church Slavic, Danish, German, Modern Greek (1453-), English, Estonian, Basque, Faroese, Persian, Finnish, French, Old French (842-ca. 1400), Irish, Galician, Gothic, Ancient Greek (to 1453), Hebrew, Hindi, Croatian, Upper Sorbian, Hungarian, Armenian, Indonesian, Italian, Japanese, Kazakh, Northern Kurdish, Korean, Latin, Latvian, Dutch, Norwegian, Nigerian Pidgin, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Northern Sami, Spanish, Serbian, Swedish, Thai, Turkish, Uighur, Ukrainian, Urdu, Vietnamese, and Chinese
- Description:
- Test data parsed by systems submitted to the CoNLL 2018 UD parsing shared task.
- Rights:
- Licence Universal Dependencies v2.2, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.2, and PUB
225. CoNLL-based Extended Czech Named Entity Corpus 1.0
- Creator:
- Konkol, Michal, Konopík, Miloslav, Ševčíková, Magda, Žabokrtský, Zdeněk, and Straková, Jana
- Publisher:
- University of West Bohemia
- Type:
- text and corpus
- Subject:
- named entity recognition, Czech, and conll
- Language:
- Czech
- Description:
- This is a Czech Named Entity Corpus 1.0 transformed into the CoNLL format. The original corpus can be downloaded from: http://hdl.handle.net/11858/00-097C-0000-0023-1B04-C. The CoNLL transformation is described in this publication: https://link.springer.com/chapter/10.1007/978-3-642-40585-3_20.
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
226. CoNLL-based Extended Czech Named Entity Corpus 2.0
- Creator:
- Konkol, Michal, Konopík, Miloslav, Ševčíková, Magda, Žabokrtský, Zdeněk, Straková, Jana, and Straka, Milan
- Publisher:
- University of West Bohemia
- Type:
- text and corpus
- Subject:
- named entity recognition and Czech
- Language:
- Czech
- Description:
- This is a Czech Named Entity Corpus 2.0 transformed into the CoNLL format. The original corpus can be downloaded from: http://hdl.handle.net/11858/00-097C-0000-0023-1B22-8. The CoNLL transformation is described in this publication: https://link.springer.com/chapter/10.1007/978-3-642-40585-3_20.
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
227. Contemporary Arabic dictionary
- Creator:
- Namly, Driss
- Publisher:
- Ibtikarat Team
- Type:
- text, lexicon, and lexicalConceptualResource
- Subject:
- lexical semantics
- Language:
- Arabic
- Description:
- An XML-based file containing the electronic version of al logha al arabia al moassira (Contemporary Arabic) dictionary. An Arabic monolingual dictionary accomplished by Ahmed Mukhtar Abdul Hamid Omar (deceased: 1424) with the help of a working group
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
228. Continuous Rating; Supplementary materials
- Creator:
- Javorský, Dávid, Macháček, Dominik, and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- manual evaluation, simultaneous speech subtitling, Continuous Rating, and questionnaire evaluation
- Language:
- German and Czech
- Description:
- Collected data from Continuous Rating evaluation study; collected Continuous Rating scores and Questionnaires.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
229. Coreference in Universal Dependencies 0.1 (CorefUD 0.1)
- Creator:
- Nedoluzhko, Anna, Novák, Michal, Popel, Martin, Žabokrtský, Zdeněk, and Zeman, Daniel
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- dependency, treebank, coreference, bridging relations, and harmonized annotation
- Language:
- Catalan, Czech, Dutch, English, French, German, Hungarian, Lithuanian, Polish, Russian, and Spanish
- Description:
- CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 0.1 consists of 17 datasets for 11 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 13 datasets for 10 languages (1 dataset for Catalan, 2 for Czech, 2 for English, 1 for French, 2 for German, 1 for Hungarian, 1 for Lithuanian, 1 for Polish, 1 for Russian, and 1 for Spanish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource too. References to original resources whose harmonized versions are contained in the public edition of CorefUD 0.1: - Catalan-AnCora: Recasens, M. and Martí, M. A. (2010). AnCora-CO: Coreferentially Annotated Corpora for Spanish and Catalan. Language Resources and Evaluation, 44(4):315–345 - Czech-PCEDT: Nedoluzhko, A., Novák, M., Cinková, S., Mikulová, M., and Mírovský, J. (2016). Coreference in Prague Czech-English Dependency Treebank. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 169–176, Portorož, Slovenia. European Language Resources Association. - Czech-PDT: Hajič, J., Bejček, E., Hlaváčová, J., Mikulová, M., Straka, M., Štěpánek, J., and Štěpánková, B. (2020). Prague Dependency Treebank - Consolidated 1.0. In Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), pages 5208–5218, Marseille, France. European Language Resources Association. - English-GUM: Zeldes, A. (2017). The GUM Corpus: Creating Multilayer Resources in the Classroom. Language Resources and Evaluation, 51(3):581–612. - English-ParCorFull: Lapshinova-Koltunski, E., Hardmeier, C., and Krielke, P. (2018). ParCorFull: a Parallel Corpus Annotated with Full Coreference. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association. - French-Democrat: Landragin, F. (2016). Description, modélisation et détection automatique des chaı̂nes de référence (DEMOCRAT). Bulletin de l’Association Française pour l’Intelligence Artificielle, (92):11–15. - German-ParCorFull: Lapshinova-Koltunski, E., Hardmeier, C., and Krielke, P. (2018). ParCorFull: a Parallel Corpus Annotated with Full Coreference. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association - German-PotsdamCC: Bourgonje, P. and Stede, M. (2020). The Potsdam Commentary Corpus 2.2: Extending annotations for shallow discourse parsing. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 1061–1066, Marseille, France. European Language Resources Association. - Hungarian-SzegedKoref: Vincze, V., Hegedűs, K., Sliz-Nagy, A., and Farkas, R. (2018). SzegedKoref: A Hungarian Coreference Corpus. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association. - Lithuanian-LCC: Žitkus, V. and Butkienė, R. (2018). Coreference Annotation Scheme and Corpus for Lithuanian Language. In Fifth International Conference on Social Networks Analysis, Management and Security, SNAMS 2018, Valencia, Spain, October 15-18, 2018, pages 243–250. IEEE. - Polish-PCC: Ogrodniczuk, M., Glowińska, K., Kopeć, M., Savary, A., and Zawisławska, M. (2013). Polish coreference corpus. In Human Language Technology. Challenges for Computer Science and Linguistics - 6th Language and Technology Conference, LTC 2013, Poznań, Poland, December 7-9, 2013. Revised Selected Papers, volume 9561 of Lecture Notes in Computer Science, pages 215–226. Springer. - Russian-RuCor: Toldova, S., Roytberg, A., Ladygina, A. A., Vasilyeva, M. D., Azerkovich, I. L., Kurzukov,M., Sim, G., Gorshkov, D. V., Ivanova, A., Nedoluzhko, A., and Grishina, Y. (2014). Evaluating Anaphora and Coreference Resolution for Russian. In Komp’juternaja lingvistika i intellektual’nye tehnologii. Po materialam ezhegodnoj Mezhdunarodnoj konferencii Dialog, pages 681–695. - Spanish-AnCora: Recasens, M. and Martí, M. A. (2010). AnCora-CO: Coreferentially Annotated Corpora for Spanish and Catalan. Language Resources and Evaluation, 44(4):315–345 References to original resources whose harmonized versions are contained in the ÚFAL-internal edition of CorefUD 0.1: - Dutch-COREA: Hendrickx, I., Bouma, G., Coppens, F., Daelemans, W., Hoste, V., Kloosterman, G., Mineur, A.-M., Van Der Vloet, J., and Verschelde, J.-L. (2008). A coreference corpus and resolution system for Dutch. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco. European Language Resources Association. - English-ARRAU: Uryupina, O., Artstein, R., Bristot, A., Cavicchio, F., Delogu, F., Rodriguez, K. J., and Poesio, M. (2020). Annotating a broad range of anaphoric phenomena, in a variety of genres: the ARRAU Corpus. Natural Language Engineering, 26(1):95–128. - English-OntoNotes: Weischedel, R., Hovy, E., Marcus, M., Palmer, M., Belvin, R., Pradhan, S., Ramshaw, L., and Xue, N. (2011). Ontonotes: A large training corpus for enhanced processing. In Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation, pages 54–63, New York. Springer-Verlag. - English-PCEDT: Nedoluzhko, A., Novák, M., Cinková, S., Mikulová, M., and Mírovský, J. (2016). Coreference in Prague Czech-English Dependency Treebank. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 169–176, Portorož, Slovenia. European Language Resources Association.
- Rights:
- Licence CorefUD v0.1, https://lindat.mff.cuni.cz/repository/xmlui/page/license-corefud-0.1, and PUB
230. Coreference in Universal Dependencies 0.2 (CorefUD 0.2)
- Creator:
- Nedoluzhko, Anna, Novák, Michal, Popel, Martin, Žabokrtský, Zdeněk, and Zeman, Daniel
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- dependency, treebank, coreference, bridging relations, and harmonized annotation
- Language:
- Catalan, Czech, Dutch, English, French, German, Hungarian, Lithuanian, Polish, Russian, and Spanish
- Description:
- CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 0.2 consists of 17 datasets for 11 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 13 datasets for 10 languages (1 dataset for Catalan, 2 for Czech, 2 for English, 1 for French, 2 for German, 1 for Hungarian, 1 for Lithuanian, 1 for Polish, 1 for Russian, and 1 for Spanish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource too. Version 0.2 consists of exactly the same datasets as the version 0.1. All automatically parsed datasets were re-parsed for v0.2 using UDPipe 2 with models trained on UD 2.6. Catalan-AnCora, Spanish-AnCora and English-GUM have been updated to match the their UD 2.9 versions.
- Rights:
- Licence CorefUD v0.2, https://lindat.mff.cuni.cz/repository/xmlui/page/license-corefud-0.2, and PUB
231. Coreference in Universal Dependencies 1.0 (CorefUD 1.0)
- Creator:
- Nedoluzhko, Anna, Novák, Michal, Popel, Martin, Žabokrtský, Zdeněk, Zeldes, Amir, Zeman, Daniel, Bourgonje, Peter, Cinková, Silvie, Hajič, Jan, Hardmeier, Christian, Krielke, Pauline, Landragin, Frédéric, Lapshinova-Koltunski, Ekaterina, Martí, M. Antònia, Mikulová, Marie, Ogrodniczuk, Maciej, Recasens, Marta, Stede, Manfred, Straka, Milan, Toldova, Svetlana, Vincze, Veronika, and Žitkus, Voldemaras
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- dependency, treebank, coreference, bridging relations, and harmonized annotation
- Language:
- Catalan, Czech, Dutch, English, French, German, Hungarian, Lithuanian, Polish, Russian, and Spanish
- Description:
- CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.0 consists of 17 datasets for 11 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 13 datasets for 10 languages (1 dataset for Catalan, 2 for Czech, 2 for English, 1 for French, 2 for German, 1 for Hungarian, 1 for Lithuanian, 1 for Polish, 1 for Russian, and 1 for Spanish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource too. Version 1.0 consists of the same corpora and languages as the previous version 0.2; however, the English GUM dataset has been updated to a newer and larger version, and in the Czech/English PCEDT dataset, the train-dev-test split has been changed to be compatible with OntoNotes. Nevertheless, the main change is in the file format (the MISC attributes have new form and interpretation).
- Rights:
- Licence CorefUD v0.2, https://lindat.mff.cuni.cz/repository/xmlui/page/license-corefud-0.2, and PUB
232. Coreference in Universal Dependencies 1.1 (CorefUD 1.1)
- Creator:
- Novák, Michal, Popel, Martin, Žabokrtský, Zdeněk, Zeman, Daniel, Nedoluzhko, Anna, Acar, Kutay, Bourgonje, Peter, Cinková, Silvie, Cebiroğlu Eryiğit, Gülşen, Hajič, Jan, Hardmeier, Christian, Haug, Dag, Jørgensen, Tollef, Kåsen, Andre, Krielke, Pauline, Landragin, Frédéric, Lapshinova-Koltunski, Ekaterina, Mæhlum, Petter, Martí, M. Antònia, Mikulová, Marie, Nøklestad, Anders, Ogrodniczuk, Maciej, Øvrelid, Lilja, Pamay Arslan, Tuğba, Recasens, Marta, Solberg, Per Erik, Stede, Manfred, Straka, Milan, Toldova, Svetlana, Vadász, Noémi, Velldal, Erik, Vincze, Veronika, Zeldes, Amir, and Žitkus, Voldemaras
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- dependency, treebank, coreference, bridging relations, and harmonized annotation
- Language:
- Catalan, Czech, English, French, German, Hungarian, Lithuanian, Norwegian, Polish, Russian, Spanish, and Turkish
- Description:
- CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.1 consists of 21 datasets for 13 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 17 datasets for 12 languages (1 dataset for Catalan, 2 for Czech, 2 for English, 1 for French, 2 for German, 2 for Hungarian, 1 for Lithuanian, 2 for Norwegian, 1 for Polish, 1 for Russian, 1 for Spanish, and 1 for Turkish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource too. Compared to the previous version 1.0, the version 1.1 comprises new languages and corpora, namely Hungarian-KorKor, Norwegian-BokmaalNARC, Norwegian-NynorskNARC, and Turkish-ITCC. In addition, the English GUM dataset has been updated to a newer and larger version, and the conversion pipelines for most datasets have been refined (a list of all changes in each dataset can be found in the corresponding README file).
- Rights:
- Licence CorefUD v1.1, https://lindat.mff.cuni.cz/repository/xmlui/page/license-corefud-1.1, and PUB
233. Coreference in Universal Dependencies 1.2 (CorefUD 1.2)
- Creator:
- Popel, Martin, Novák, Michal, Žabokrtský, Zdeněk, Zeman, Daniel, Nedoluzhko, Anna, Acar, Kutay, Bamman, David, Bourgonje, Peter, Cinková, Silvie, Eckhoff, Hanne, Cebiroğlu Eryiğit, Gülşen, Hajič, Jan, Hardmeier, Christian, Haug, Dag, Jørgensen, Tollef, Kåsen, Andre, Krielke, Pauline, Landragin, Frédéric, Lapshinova-Koltunski, Ekaterina, Mæhlum, Petter, Martí, M. Antònia, Mikulová, Marie, Nøklestad, Anders, Ogrodniczuk, Maciej, Øvrelid, Lilja, Pamay Arslan, Tuğba, Recasens, Marta, Solberg, Per Erik, Stede, Manfred, Straka, Milan, Swanson, Daniel, Toldova, Svetlana, Vadász, Noémi, Velldal, Erik, Vincze, Veronika, Zeldes, Amir, and Žitkus, Voldemaras
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- coreference, bridging relations, harmonized annotation, dependency, and treebank
- Language:
- Ancient Greek (to 1453), Ancient Hebrew, Catalan, Czech, English, French, German, Hungarian, Lithuanian, Norwegian, Church Slavic, Polish, Russian, Spanish, and Turkish
- Description:
- CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.2 consists of 25 datasets for 16 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 21 datasets for 15 languages (1 dataset for Ancient Greek, 1 for Ancient Hebrew, 1 for Catalan, 2 for Czech, 3 for English, 1 for French, 2 for German, 2 for Hungarian, 1 for Lithuanian, 2 for Norwegian, 1 for Old Church Slavonic, 1 for Polish, 1 for Russian, 1 for Spanish, and 1 for Turkish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource, too. Compared to the previous version 1.1, the version 1.2 comprises new languages and corpora, namely Ancient_Greek-PROIEL, Ancient_Hebrew-PTNK, English-LitBank, and Old_Church_Slavonic-PROIEL. In addition, English-GUM and Turkish-ITCC have been updated to newer versions, conversion of zeros in Polish-PCC has been improved, and the conversion pipelines for multiple other datasets have been refined (a list of all changes in each dataset can be found in the corresponding README file).
- Rights:
- Licence CorefUD v1.2, https://lindat.mff.cuni.cz/repository/xmlui/page/license-corefud-1.2, and PUB
234. CORMAP - Corpus for Moroccan Arabic Processing
- Creator:
- tachicart, ridouane and bouzoubaa, karim
- Publisher:
- ALELM
- Type:
- text and corpus
- Subject:
- corpus
- Language:
- Arabic and Moroccan Arabic
- Description:
- This resource is a corpus containing 34k Moroccan Colloquial Arabic sentences collected from different sources. The sentences are written in Arabic letters. This resource can be useful in some NLP applications such as Language Identification.
- Rights:
- Licence Universal Dependencies v2.1, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.1, and PUB
235. CorPipe 23 multilingual CorefUD 1.1 model (corpipe23-corefud1.1-231206)
- Creator:
- Straka, Milan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- tool and toolService
- Subject:
- coreference resolution, CorPipe, and CorefUD
- Language:
- Catalan, Czech, German, English, Spanish, French, Hungarian, Lithuanian, Norwegian Bokmål, Norwegian Nynorsk, Polish, Russian, and Turkish
- Description:
- The `corpipe23-corefud1.1-231206` is a `mT5-large`-based multilingual model for coreference resolution usable in CorPipe 23 (https://github.com/ufal/crac2023-corpipe). It is released under the CC BY-NC-SA 4.0 license. The model is language agnostic (no _corpus id_ on input), so it can be used to predict coreference in any `mT5` language (for zero-shot evaluation, see the paper). However, note that the empty nodes must be present already on input, they are not predicted (the same settings as in the CRAC23 shared task).
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
236. Corpus for training and evaluating diacritics restoration systems
- Creator:
- Náplava, Jakub, Straka, Milan, Hajič, Jan, and Straňák, Pavel
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- diacritical marks generation and natural language correction
- Language:
- Czech, Vietnamese, Romanian, Polish, Slovak, Spanish, Croatian, Irish, Latvian, Hungarian, French, and Turkish
- Description:
- Corpus of texts in 12 languages. For each language, we provide one training, one development and one testing set acquired from Wikipedia articles. Moreover, each language dataset contains (substantially larger) training set collected from (general) Web texts. All sets, except for Wikipedia and Web training sets that can contain similar sentences, are disjoint. Data are segmented into sentences which are further word tokenized. All data in the corpus contain diacritics. To strip diacritics from them, use Python script diacritization_stripping.py contained within attached stripping_diacritics.zip. This script has two modes. We generally recommend using method called uninames, which for some languages behaves better. The code for training recurrent neural-network based model for diacritics restoration is located at https://github.com/arahusky/diacritics_restoration.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
237. Corpus of contemporary blogs
- Creator:
- Grác, Marek
- Publisher:
- Masaryk University, NLP Centre
- Type:
- text and corpus
- Subject:
- corpus, blogs, annotation, annotators, sentences, and machine learning
- Language:
- Czech
- Description:
- In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based system. In order to make enough training data for machine learning, annotators manually split the corpus of contemporary text CBB.blog (1 million tokens) into sentences. Each file contains one hundredth of the whole corpus and all data were processed in parallel by two annotators. The corpus was created from ten contemporary blogs: hintzu.otaku.cz modnipeklo.cz bloc.cz aleneprokopova.blogspot.com blog.aktualne.cz fuchsova.blog.onaidnes.cz havlik.blog.idnes.cz blog.aktualne.centrum.cz klusak.blogspot.cz myego.cz/welldone
- Rights:
- Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB
238. Corpus of precisely articulated Czech speech
- Creator:
- Hanzlíček, Zdeněk, Kochová, Pavla, Tihelka, Daniel, Kövérová, Markéta, Matoušek, Jindřich, and Ševeček, Pavel
- Publisher:
- University of West Bohemia, Department of Cybernetics and Lingea, s.r.o.
- Type:
- audio and corpus
- Subject:
- speech corpus, text-to-speech (TTS), speech synthesis, and hyperarticulated speech
- Language:
- Czech
- Description:
- The corpus contains speech data of 2 Czech native speakers, male and female. The speech is very precisely articulated up to hyper-articulated, and the speech rate is low. The speech data with a highlighted articulation is suitable for teaching foreigners the Czech language, and it can also be used for people with hearing or speech impairment. The recorded sentences can be used either directly, e.g., as a part of educational material, or as source data for building complex educational systems incorporating speech synthesis technology. All recorded sentences were precisely orthographically annotated and phonetically segmented, i.e., split into phones, using modern neural network-based methods.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
239. Corpus OVER
- Creator:
- Col, Gilles
- Publisher:
- Université de Poitiers
- Type:
- text and corpus
- Subject:
- over, semantics, instruction, and corpus-data
- Language:
- English
- Description:
- Many studies in cognitive linguistics have analysed the semantics of 'over', notably the semantics associated with 'over' as a preposition. Most of them generally conclude that 'over' is polysemic and this polysemy is to be described thanks to a semantic radial network, showing the relationships between the different meanings of the word. What we would like to suggest on the contrary is that the meanings of 'over' are highly dependent on the utterance context in which its occurrences are embedded, and consequently that the meaning of 'over' itself is under-specified, rather than polysemic. Moreover, to provide a more accurate account of the apparent wide range of meanings of 'over' in context, we ought to take into account the other uses of this unit: as an adverb and particle, and not only as a preposition. In this paper, we provide a corpus-based description of 'over' which leads us to propose a monosemic definition. ,So as to achiev such a description, we used a short dataset of randomly selected 326 sentences containing 'over' in various positions in the sentences and corresponding to various categories.
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
240. COSTRA 1.0: A Dataset of Complex Sentence Transformations
- Creator:
- Barančíková, Petra and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- sentences, sentence embeddings, paraphrases, and semantic relations
- Language:
- Czech
- Description:
- COSTRA 1.0 is a dataset of Czech complex sentence transformations. The dataset is intended for the study of sentence-level embeddings beyond simple word alternations or standard paraphrasing. The dataset consist of 4,262 unique sentences with average length of 10 words, illustrating 15 types of modifications such as simplification, generalization, or formal and informal language variation. The hope is that with this dataset, we should be able to test semantic properties of sentence embeddings and perhaps even to find some topologically interesting “skeleton” in the sentence embedding space.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
241. COSTRA 1.1: A Dataset of Complex Sentence Transformations and Comparisons
- Creator:
- Barančíková, Petra and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- paraphrases, sentence embeddings, evaluation, and sentence
- Language:
- Czech
- Description:
- Costra 1.1 is a new dataset for testing geometric properties of sentence embeddings spaces. In particular, it concentrates on examining how well sentence embeddings capture complex phenomena such paraphrases, tense or generalization. The dataset is a direct expansion of Costra 1.0, which was extended with more sentences and sentence comparisons.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
242. Course for Ice Sports Instructors
- Creator:
- Aktualita
- Publisher:
- Národní filmový archiv
- Type:
- video and clip
- Subject:
- hokej výuka, rychlobruslení výuka, kurzy ledních sportů, akce Kuratorium pro výchovu mládeže, Kuratorium pro výchovu mládeže, bruslení výuka, výuka ledních sportů, stadion zimní, trénink hokejový, píšťalka, Kuratorium, Places::Praha::Holešovice::Štvanice::zimní stadion, People::Teuner František (1911-1978), People::Maleček Josef (1903-1982), People::Zábrodský Vladimír (1923-2020), People::Tožička Jiří (1901-1981), and Český zvukový týdeník Aktualita::1943/9B
- Language:
- Czech
- Description:
- Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 9B from 1943 captures an ice sports course for mandatory youth service instructors, which was organised by the Board of Trustees for the Education of Youth as part of the Ice Sports Week event held at Štvanice Ice Arena in Prague from 1 to 6 February. Training in speed skating and ice hockey was led by hockey players Josef Maleček, Vladimír Zábrodský and Jiří Tožička. The event was attended by General Secretary of the Board František Teuner.
- Rights:
- http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
243. Covid-19 Thesaurus
- Creator:
- Fener, Patricia
- Publisher:
- Institute for scientific and technical information (Inist) - CNRS/UAR76
- Type:
- thesaurus, text, and lexicalConceptualResource
- Subject:
- COVID-19, SARS coronavirus, Middle-East coronavirus, SARS-CoV, and MERS-CoV
- Language:
- French and English
- Description:
- This bilingual thesaurus (French-English), developed at Inist-CNRS, covers the concepts from the emerging COVID-19 outbreak which reminds the past SARS coronavirus outbreak and Middle East coronavirus outbreak. This thesaurus is based on the vocabulary used in scientific publications for SARS-CoV-2 and other coronaviruses, like SARS-CoV and MERS-CoV. It provides a support to explore the coronavirus infectious diseases. The thesaurus can be browsed and queried by humans and machines on the Loterre portal (https://www.loterre.fr), via an API and an rdf triplestore. It is also downloadable in PDF, SKOS, csv and json-ld formats. The thesaurus is made available under a CC-by 4.0 license.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), PUB, and http://creativecommons.org/licenses/by/4.0/
244. CsEnVi Pairwise Parallel Corpora
- Creator:
- Hoang, Duc Tam and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- corpus, Vietnamese, parallel corpus, Czech-Vietnamese corpus, and English-Vietnamese corpus
- Language:
- Czech, English, and Vietnamese
- Description:
- CsEnVi Pairwise Parallel Corpora consist of Vietnamese-Czech parallel corpus and Vietnamese-English parallel corpus. The corpora were assembled from the following sources: - OPUS, the open parallel corpus is a growing multilingual corpus of translated open source documents. The majority of Vi-En and Vi-Cs bitexts are subtitles from movies and television series. The nature of the bitexts are paraphrasing of each other's meaning, rather than translations. - TED talks, a collection of short talks on various topics, given primarily in English, transcribed and with transcripts translated to other languages. In our corpus, we use 1198 talks which had English and Vietnamese transcripts available and 784 talks which had Czech and Vietnamese transcripts available in January 2015. The size of the original corpora collected from OPUS and TED talks is as follows: CS/VI EN/VI Sentence 1337199/1337199 2035624/2035624 Word 9128897/12073975 16638364/17565580 Unique word 224416/68237 91905/78333 We improve the quality of the corpora in two steps: normalizing and filtering. In the normalizing step, the corpora are cleaned based on the general format of subtitles and transcripts. For instance, sequences of dots indicate explicit continuation of subtitles across multiple time frames. The sequences of dots are distributed differently in the source and the target side. Removing the sequence of dots, along with a number of other normalization rules, improves the quality of the alignment significantly. In the filtering step, we adapt the CzEng filtering tool [1] to filter out bad sentence pairs. The size of cleaned corpora as published is as follows: CS/VI EN/VI Sentence 1091058/1091058 1113177/1091058 Word 6718184/7646701 8518711/8140876 Unique word 195446/59737 69513/58286 The corpora are used as training data in [2]. References: [1] Ondřej Bojar, Zdeněk Žabokrtský, et al. 2012. The Joy of Parallelism with CzEng 1.0. Proceedings of LREC2012. ELRA. Istanbul, Turkey. [2] Duc Tam Hoang and Ondřej Bojar, The Prague Bulletin of Mathematical Linguistics. Volume 104, Issue 1, Pages 75–86, ISSN 1804-0462. 9/2015
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
245. Ctibor Blatný (phytopathologist)
- Creator:
- Veselý, Bohumil
- Publisher:
- Národní filmový archiv
- Type:
- video and clip
- Subject:
- Galerie osobností, Places::Praha::Nové Město::Školská::pavlač domu, Places::Praha::Dejvice::Hanspaulka::vila Oldřicha Blažíčka /ext.,int./, and People::Blatný Ctibor (1897-1978)
- Language:
- No linguistic content
- Description:
- Professor and phytopathologist Ctibor Blatný on Bohumil Veselý's balcony.
- Rights:
- http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
246. CUBBITT Translation Models (en-cs) (v1.0)
- Creator:
- Popel, Martin, Tomková, Markéta, Tomek, Jakub, Kaiser, Łukasz, Uszkoreit, Jakob, Bojar, Ondřej, and Žabokrtský, Zdeněk
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- tool and toolService
- Subject:
- machine translation, neural machine translation, transformer, and cubbitt
- Language:
- English and Czech
- Description:
- CUBBITT En-Cs translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on newstest2014 (BLEU): en->cs: 27.6 cs->en: 34.4 (Evaluated using multeval: https://github.com/jhclark/multeval)
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
247. CUBBITT Translation Models (en-fr) (v1.0)
- Creator:
- Popel, Martin, Tomková, Markéta, Tomek, Jakub, Kaiser, Łukasz, Uszkoreit, Jakob, Bojar, Ondřej, and Žabokrtský, Zdeněk
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- tool and toolService
- Subject:
- machine translation, neural machine translation, transformer, and cubbitt
- Language:
- English and French
- Description:
- CUBBITT En-Fr translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on newstest2014 (BLEU): en->fr: 38.2 fr->en: 36.7 (Evaluated using multeval: https://github.com/jhclark/multeval)
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
248. CUBBITT Translation Models (en-pl) (v1.0)
- Creator:
- Popel, Martin, Tomková, Markéta, Tomek, Jakub, Kaiser, Łukasz, Uszkoreit, Jakob, Bojar, Ondřej, and Žabokrtský, Zdeněk
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- tool and toolService
- Subject:
- machine translation, neural machine translation, transformer, and cubbitt
- Language:
- English and Polish
- Description:
- CUBBITT En-Pl translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on newstest2020 (BLEU): en->pl: 12.3 pl->en: 20.0 (Evaluated using multeval: https://github.com/jhclark/multeval)
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
249. CWC2011
- Creator:
- Spoustová, Johanka and Spousta, Miroslav
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- corpus, Czech, and web
- Language:
- Czech
- Description:
- Web corpus of Czech, created in 2011. Contains newspapers+magazines, discussions, blogs. See http://www.lrec-conf.org/proceedings/lrec2012/summaries/120.html for details. and GA405/09/0278
- Rights:
- Creative Commons - Attribution 3.0 Unported (CC BY 3.0), http://creativecommons.org/licenses/by/3.0/, and PUB
250. Cyril Bouda (painter, illustrator)
- Creator:
- Veselý, Bohumil
- Publisher:
- Národní filmový archiv
- Type:
- video and clip
- Subject:
- Galerie osobností, Places::Praha::Nové Město::Školská::pavlač domu, and People::Bouda Cyril (1901-1984)
- Language:
- No linguistic content
- Description:
- Painter and illustrator Cyril Boud on Bohumil Veselý's balcony.
- Rights:
- http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
251. Cyril Merhout (historian)
- Creator:
- Veselý, Bohumil
- Publisher:
- Národní filmový archiv
- Type:
- video and clip
- Subject:
- Galerie osobností, Places::Praha::Nové Město::Školská::pavlač domu, and People::Merhout Cyril (1881-1955)
- Language:
- No linguistic content
- Description:
- Historian Cyril Merhout on Bohumil Veselý's balcony.
- Rights:
- http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
252. Czech and English abstracts of ÚFAL papers
- Creator:
- Rosa, Rudolf
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- parallel corpus, scientific texts, and abstracts
- Language:
- Czech and English
- Description:
- This is a document-aligned parallel corpus of English and Czech abstracts of scientific papers published by authors from the Institute of Formal and Applied Linguistics, Charles University in Prague, as reported in the institute's system Biblio. For each publication, the authors are obliged to provide both the original abstract in Czech or English, and its translation into English or Czech, respectively. No filtering was performed, except for removing entries missing the Czech or English abstract, and replacing newline and tabulator characters by spaces.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
253. Czech and English abstracts of ÚFAL papers (2022-11-11)
- Creator:
- Rosa, Rudolf and Zouhar, Vilém
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- parallel corpus, scientific texts, and abstracts
- Language:
- Czech and English
- Description:
- This is a parallel corpus of Czech and mostly English abstracts of scientific papers and presentations published by authors from the Institute of Formal and Applied Linguistics, Charles University in Prague. For each publication record, the authors are obliged to provide both the original abstract (in Czech or English), and its translation (English or Czech) in the internal Biblio system. The data was filtered for duplicates and missing entries, ensuring that every record is bilingual. Additionally, records of published papers which are indexed by SemanticScholar contain the respective link. The dataset was created from September 2022 image of the Biblio database and is stored in JSONL format, with each line corresponding to one record.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
254. Czech and English Reflective Dataset (CEReD)
- Creator:
- Štefánik, Michal and Nehyba, Jan
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- reflective writing, reflective categories, pre-service teachers, and hand annotation
- Language:
- English and Czech
- Description:
- The database contains annotated reflective sentences, which fall into the categories of reflective writing according to Ullmann's (2019) model. The dataset is ready to replicate these categories' prediction using machine learning. Available from: https://anonymous.4open.science/repository/c856595c-dfc2-48d7-aa3d-0ccc2648c4dc/data
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
255. Czech Court Decisions Corpus (CzCDC 1.0)
- Creator:
- Novotná, Tereza and Harašta, Jakub
- Publisher:
- Faculty of Law, Masaryk University
- Type:
- text and corpus
- Subject:
- legal texts, judicial decisions, and court decisions
- Language:
- Czech
- Description:
- This is the Czech Court Decisions Corpus (CzCDC 1.0). This corpus contains whole texts of the decisions from three top-tier courts (Supreme, Supreme Administrative and Constitutional court) in Czech republic. Court decisions are published from 1st January 1993 to 30th September 2018. The language of decisions is Czech. Content of decisions is unedited and obtained directly from the competent court. Decisions are in .txt format in three folders divided by courts. Corpus contains three .csv files containing the list of all decisions with four columns: - name of the file: exact file name of a decision with extension .txt; - decision identifier (docket number): official identification of the decision as issued by the court; - date of decision: in ISO 8601 (YYYY-MM-DD); - court abbreviation: SupCo for Supreme Court, SupAdmCo for Supreme Administrative Court, ConCo for Constitutional Court Statistics: - SupCo: 111 977 decisions, 23 699 639 lines, 224 061 129 words, 1 462 948 200 bits; - SupAdmCo: 52 660 decisions, 18 069 993 lines, 137 839 985 words, 1 067 826 507 bits; - ConCo: 73 086 decisions, 6 178 371 lines, 98 623 753 words, 664 657 755 bits - all courts combined: 237 723 decisions, 47 948 003 lines, 460 524 867 words, 3 195 432 462 bits
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
256. Czech Court Decisions Dataset
- Creator:
- Kríž, Vincent and Hladká, Barbora
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- named entities, annotation, and corpus
- Language:
- Czech
- Description:
- We present the Czech Court Decisions Dataset (CCDD) -- a dataset of 300 manually annotated court decisions published by The Supreme Court of the Czech Republic and the Constitutional Court of the Czech Republic.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
257. Czech Grammar Agreement Dataset for Evaluation of Language Models
- Creator:
- Baisa, Vít
- Publisher:
- Masaryk University, NLP Centre
- Type:
- text and corpus
- Subject:
- agreement, past tense verb suffix, language model, and training data
- Language:
- Czech
- Description:
- AGREE is a dataset and task for evaluation of language models based on grammar agreement in Czech. The dataset consists of sentences with marked suffixes of past tense verbs. The task is to choose the right verb suffix which depends on gender, number and animacy of subject. It is challenging for language models because 1) Czech is morphologically rich, 2) it has relatively free word order, 3) high out-of-vocabulary (OOV) ratio, 4) predicate and subject can be far from each other, 5) subjects can be unexpressed and 6) various semantic rules may apply. The task provides a straightforward and easily reproducible way of evaluating language models on a morphologically rich language.
- Rights:
- Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB
258. Czech HS Contracts Dataset (CHSC) 1.0
- Creator:
- Szabó, Adam and Straka, Milan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- Czech, document classification, contracts, and Hlídač státu
- Language:
- Czech
- Description:
- Czech Contracts dataset was created as a part of the thesis Low-resource Text Classification (2021), A. Szabó, MFF UK. Contracts are obtained from the Hlídač Státu web portal. Labels in the development and training set are automatically classified on the basis of the keyword method according to the thesis Automatická klasifikace smluv pro portál HlidacSmluv.cz, J. Maroušek (2020), MFF UK. For this reason, the goal in the classification is not to achieve 100% on the development set, as the classification contains a certain amount of noise. The test set is manually annotated. The dataset contains a total of 97493 contracts.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), PUB, and http://creativecommons.org/licenses/by-nc-sa/4.0/
259. Czech image captioning, machine translation, and sentiment analysis (Neural Monkey models)
- Creator:
- Libovický, Jindřich, Rosa, Rudolf, Helcl, Jindřich, and Popel, Martin
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- suiteOfTools and toolService
- Subject:
- sentiment analysis, machine translation, image captioning, neural networks, transformer, and Neural Monkey
- Language:
- Czech and English
- Description:
- This submission contains trained end-to-end models for the Neural Monkey toolkit for Czech and English, solving three NLP tasks: machine translation, image captioning, and sentiment analysis. The models are trained on standard datasets and achieve state-of-the-art or near state-of-the-art performance in the tasks. The models are described in the accompanying paper. The same models can also be invoked via the online demo: https://ufal.mff.cuni.cz/grants/lsd There are several separate ZIP archives here, each containing one model solving one of the tasks for one language. To use a model, you first need to install Neural Monkey: https://github.com/ufal/neuralmonkey To ensure correct functioning of the model, please use the exact version of Neural Monkey specified by the commit hash stored in the 'git_commit' file in the model directory. Each model directory contains a 'run.ini' Neural Monkey configuration file, to be used to run the model. See the Neural Monkey documentation to learn how to do that (you may need to update some paths to correspond to your filesystem organization). The 'experiment.ini' file, which was used to train the model, is also included. Then there are files containing the model itself, files containing the input and output vocabularies, etc. For the sentiment analyzers, you should tokenize your input data using the Moses tokenizer: https://pypi.org/project/mosestokenizer/ For the machine translation, you do not need to tokenize the data, as this is done by the model. For image captioning, you need to: - download a trained ResNet: http://download.tensorflow.org/models/resnet_v2_50_2017_04_14.tar.gz - clone the git repository with TensorFlow models: https://github.com/tensorflow/models - preprocess the input images with the Neural Monkey 'scripts/imagenet_features.py' script (https://github.com/ufal/neuralmonkey/blob/master/scripts/imagenet_features.py) -- you need to specify the path to ResNet and to the TensorFlow models to this script Feel free to contact the authors of this submission in case you run into problems!
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
260. Czech image captioning, machine translation, sentiment analysis and summarization (Neural Monkey models)
- Creator:
- Libovický, Jindřich, Rosa, Rudolf, Helcl, Jindřich, and Popel, Martin
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- suiteOfTools and toolService
- Subject:
- sentiment analysis, machine translation, image captioning, neural networks, transformer, Neural Monkey, and summarization
- Language:
- Czech and English
- Description:
- This submission contains trained end-to-end models for the Neural Monkey toolkit for Czech and English, solving four NLP tasks: machine translation, image captioning, sentiment analysis, and summarization. The models are trained on standard datasets and achieve state-of-the-art or near state-of-the-art performance in the tasks. The models are described in the accompanying paper. The same models can also be invoked via the online demo: https://ufal.mff.cuni.cz/grants/lsd In addition to the models presented in the referenced paper (developed and published in 2018), we include models for automatic news summarization for Czech and English developed in 2019. The Czech models were trained using the SumeCzech dataset (https://www.aclweb.org/anthology/L18-1551.pdf), the English models were trained using the CNN-Daily Mail corpus (https://arxiv.org/pdf/1704.04368.pdf) using the standard recurrent sequence-to-sequence architecture. There are several separate ZIP archives here, each containing one model solving one of the tasks for one language. To use a model, you first need to install Neural Monkey: https://github.com/ufal/neuralmonkey To ensure correct functioning of the model, please use the exact version of Neural Monkey specified by the commit hash stored in the 'git_commit' file in the model directory. Each model directory contains a 'run.ini' Neural Monkey configuration file, to be used to run the model. See the Neural Monkey documentation to learn how to do that (you may need to update some paths to correspond to your filesystem organization). The 'experiment.ini' file, which was used to train the model, is also included. Then there are files containing the model itself, files containing the input and output vocabularies, etc. For the sentiment analyzers, you should tokenize your input data using the Moses tokenizer: https://pypi.org/project/mosestokenizer/ For the machine translation, you do not need to tokenize the data, as this is done by the model. For image captioning, you need to: - download a trained ResNet: http://download.tensorflow.org/models/resnet_v2_50_2017_04_14.tar.gz - clone the git repository with TensorFlow models: https://github.com/tensorflow/models - preprocess the input images with the Neural Monkey 'scripts/imagenet_features.py' script (https://github.com/ufal/neuralmonkey/blob/master/scripts/imagenet_features.py) -- you need to specify the path to ResNet and to the TensorFlow models to this script The summarization models require input that is tokenized with Moses Tokenizer (https://github.com/alvations/sacremoses) and lower-cased. Feel free to contact the authors of this submission in case you run into problems!
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
261. Czech Legal Text Treebank
- Creator:
- Kríž, Vincent, Hladká, Barbora, and Urešová, Zdeňka
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- treebank, corpus, Czech, legal texts, and legal domain
- Language:
- Czech
- Description:
- The Czech Legal Text Treebank (CLTT) is a collection of 1133 manually annotated dependency trees. CLTT consists of two legal documents: The Accounting Act (563/1991 Coll., as amended) and Decree on Double-entry Accounting for undertakers (500/2002 Coll., as amended).
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
262. Czech Legal Text Treebank 2.0
- Creator:
- Kríž, Vincent and Hladká, Barbora
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- treebank, Prague dependencies, named entities, and semantic relations
- Language:
- Czech
- Description:
- The Czech Legal Text Treebank 2.0 (CLTT 2.0) annotates the same texts as the CLTT 1.0. These texts come from the legal domain and they are manually syntactically annotated. The CLTT 2.0 annotation on the syntactic layer is more elaborate than in the CLTT 1.0 from various aspects. In addition, new annotation layers were added to the data: (i) the layer of accounting entities, and (ii) the layer of semantic entity relations.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
263. Czech Lexico-Semantic Database 0.1
- Creator:
- Tichy, Ondrej, Obstova, Zora, and Klegr, Ales
- Publisher:
- Charles University, Faculty of Arts
- Type:
- text, thesaurus, and lexicalConceptualResource
- Subject:
- onomasiological lexicography, thesaurus, lexico-semantic database, digitization, and Czech
- Language:
- Czech
- Description:
- A lexicographical project, whose aim is to digitize and align two Czech onomasiological dictionaries (Haller 1969–77; Klégr 2007) in order to create an integrated digital multi-purpose lexico-semantic database of Czech.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
264. Czech Malach Cross-lingual Speech Retrieval Test Collection
- Creator:
- Galuščáková, Petra, Pecina, Pavel, Hoffmannová, Petra, Hajič, Jan, Ircing, Pavel, and Švec, Jan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- audio and corpus
- Subject:
- annotated corpus, corpus, speech corpus, annotation, audio, and multilingual
- Language:
- Czech, English, French, German, and Spanish
- Description:
- The package contains Czech recordings of the Visual History Archive which consists of the interviews with the Holocaust survivors. The archive consists of audio recordings, four types of automatic transcripts, manual annotations of selected topics and interviews' metadata. The archive totally contains 353 recordings and 592 hours of interviews.
- Rights:
- Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB
265. Czech Models (CNEC) for NameTag
- Creator:
- Straka, Milan and Straková, Jana
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, languageDescription, and mlmodel
- Subject:
- NameTag, Czech, and named entity recognition
- Language:
- Czech
- Description:
- Czech models for NameTag, providing recognition of named entities. The models are trained on Czech Named Entity Corpus 2.0 and 1.1. and This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013). Czech models are trained on Czech Named Entity Corpus, which was created by Magda Ševčíková, Zdeněk Žabokrtský, Jana Straková and Milan Straka. The recognizer research was supported by the projects MSM0021620838 and LC536 of Ministry of Education, Youth and Sports of the Czech Republic, 1ET101120503 of Academy of Sciences of the Czech Republic, LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013), and partially by SVV project number 267 314. The research was performed by Jana Straková, Zdeněk Žabokrtský and Milan Straka. Czech models use MorphoDiTa as a tagger and lemmatizer, therefore MorphoDiTa Acknowledgements (http://ufal.mff.cuni.cz/morphodita#morphodita_acknowledgements) and Czech MorphoDiTa Model Acknowledgements (http://ufal.mff.cuni.cz/morphodita/users-manual#czech-morfflex-pdt_acknowledgements) apply.
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
266. Czech Models (MorfFlex CZ + PDT) for MorphoDiTa
- Creator:
- Straka, Milan and Straková, Jana
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, languageDescription, and mlmodel
- Subject:
- MorphoDiTa, Czech, morphological analysis, morphological generation, and PoS tagging
- Language:
- Czech
- Description:
- Czech models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging. The morphological dictionary is created from MorfFlex CZ and the PoS tagger is trained on PDT (Prague Dependency Treebank). and This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013). The Czech morphologic system was devised by Jan Hajič. The MorfFlex CZ dictionary was created by Jan Hajič and Jaroslava Hlaváčová. The morphologic guesser research was supported by the projects 1ET101120503 and 1ET101120413 of Academy of Sciences of the Czech Republic and 100008/2008 of Charles University Grant Agency. The research was performed by Jan Hajič, Jaroslava Hlaváčová and David Kolovratník. The tagger algorithm and feature set research was supported by the projects MSM0021620838 and LC536 of Ministry of Education, Youth and Sports of the Czech Republic, GA405/09/0278 of the Grant Agency of the Czech Republic and 1ET101120503 of Academy of Sciences of the Czech Republic. The research was performed by Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab and Miroslav Spousta. The tagger is trained on morphological layer of Prague Dependency Treebank PDT 2.5, which was supported by the projects LM2010013, LC536, LN00A063 and MSM0021620838 of Ministry of Education, Youth and Sports of the Czech Republic, and developed by Martin Buben, Jan Hajič, Jiří Hana, Hana Hanová, Barbora Hladká, Emil Jeřábek, Lenka Kebortová, Kristýna Kupková, Pavel Květoň, Jiří Mírovský, Andrea Pfimpfrová, Jan Štěpánek and Daniel Zeman.
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
267. Czech Models (MorfFlex CZ 160310 + PDT 3.0) for MorphoDiTa 160310
- Creator:
- Straka, Milan and Straková, Jana
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, languageDescription, and mlmodel
- Subject:
- MorphoDiTa, Czech, morphological analysis, morphological generation, and PoS tagging
- Language:
- Czech
- Description:
- Czech models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging. The morphological dictionary is created from MorfFlex CZ 160310 and the PoS tagger is trained on Prague Dependency Treebank 3.0 (PDT). and This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013). The Czech morphologic system was devised by Jan Hajič. The MorfFlex CZ dictionary was created by Jan Hajič and Jaroslava Hlaváčová. The morphologic guesser research was supported by the projects 1ET101120503 and 1ET101120413 of Academy of Sciences of the Czech Republic and 100008/2008 of Charles University Grant Agency. The research was performed by Jan Hajič, Jaroslava Hlaváčová and David Kolovratník. The tagger algorithm and feature set research was supported by the projects MSM0021620838 and LC536 of Ministry of Education, Youth and Sports of the Czech Republic, GA405/09/0278 of the Grant Agency of the Czech Republic and 1ET101120503 of Academy of Sciences of the Czech Republic. The research was performed by Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab and Miroslav Spousta. The tagger is trained on morphological layer of Prague Dependency Treebank PDT 2.5, which was supported by the projects LM2010013, LC536, LN00A063 and MSM0021620838 of Ministry of Education, Youth and Sports of the Czech Republic, and developed by Martin Buben, Jan Hajič, Jiří Hana, Hana Hanová, Barbora Hladká, Emil Jeřábek, Lenka Kebortová, Kristýna Kupková, Pavel Květoň, Jiří Mírovský, Andrea Pfimpfrová, Jan Štěpánek and Daniel Zeman.
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
268. Czech Models (MorfFlex CZ 161115 + PDT 3.0) for MorphoDiTa 161115
- Creator:
- Straka, Milan and Straková, Jana
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, languageDescription, and mlmodel
- Subject:
- MorphoDiTa, Czech, morphological analysis, morphological generation, and PoS tagging
- Language:
- Czech
- Description:
- Czech models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging. The morphological dictionary is created from MorfFlex CZ 161115 and DeriNet 1.2 and the PoS tagger is trained on Prague Dependency Treebank 3.0 (PDT). and This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013). The Czech morphologic system was devised by Jan Hajič. The MorfFlex CZ dictionary was created by Jan Hajič and Jaroslava Hlaváčová. The morphologic guesser research was supported by the projects 1ET101120503 and 1ET101120413 of Academy of Sciences of the Czech Republic and 100008/2008 of Charles University Grant Agency. The research was performed by Jan Hajič, Jaroslava Hlaváčová and David Kolovratník. The tagger algorithm and feature set research was supported by the projects MSM0021620838 and LC536 of Ministry of Education, Youth and Sports of the Czech Republic, GA405/09/0278 of the Grant Agency of the Czech Republic and 1ET101120503 of Academy of Sciences of the Czech Republic. The research was performed by Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab and Miroslav Spousta. The tagger is trained on morphological layer of Prague Dependency Treebank PDT 2.5, which was supported by the projects LM2010013, LC536, LN00A063 and MSM0021620838 of Ministry of Education, Youth and Sports of the Czech Republic, and developed by Martin Buben, Jan Hajič, Jiří Hana, Hana Hanová, Barbora Hladká, Emil Jeřábek, Lenka Kebortová, Kristýna Kupková, Pavel Květoň, Jiří Mírovský, Andrea Pfimpfrová, Jan Štěpánek and Daniel Zeman.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
269. Czech Models (MorfFlex CZ 2.0 + PDT-C 1.0) for MorphoDiTa 220710
- Creator:
- Straka, Milan and Straková, Jana
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, languageDescription, and mlmodel
- Subject:
- MorphoDiTa, Czech, morphological analysis, morphological generation, and PoS tagging
- Language:
- Czech
- Description:
- Czech models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging. The morphological dictionary is created from MorfFlex CZ 2.0, DeriNet 2.1 and the PoS tagger is trained on Prague Dependency Treebank - Consolidated 1.0. and This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013). The Czech morphologic system was devised by Jan Hajič. The MorfFlex CZ dictionary was created by Jan Hajič and Jaroslava Hlaváčová. The morphologic guesser research was supported by the projects 1ET101120503 and 1ET101120413 of Academy of Sciences of the Czech Republic and 100008/2008 of Charles University Grant Agency. The research was performed by Jan Hajič, Jaroslava Hlaváčová and David Kolovratník. The tagger algorithm and feature set research was supported by the projects MSM0021620838 and LC536 of Ministry of Education, Youth and Sports of the Czech Republic, GA405/09/0278 of the Grant Agency of the Czech Republic and 1ET101120503 of Academy of Sciences of the Czech Republic. The research was performed by Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab and Miroslav Spousta. The tagger is trained on morphological layer of Prague Dependency Treebank PDT 2.5, which was supported by the projects LM2010013, LC536, LN00A063 and MSM0021620838 of Ministry of Education, Youth and Sports of the Czech Republic, and developed by Martin Buben, Jan Hajič, Jiří Hana, Hana Hanová, Barbora Hladká, Emil Jeřábek, Lenka Kebortová, Kristýna Kupková, Pavel Květoň, Jiří Mírovský, Andrea Pfimpfrová, Jan Štěpánek and Daniel Zeman.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
270. Czech Models for Korektor 2
- Creator:
- Richter, Michal
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, languageDescription, and mlmodel
- Subject:
- Korektor, Czech, spellchecker, spellchecking, and diacritical marks generation
- Language:
- Czech
- Description:
- The Czech models for Korektor 2 created by Michal Richter, 02 Feb 2013. The models can either perform spellchecking and grammarchecking, or only generate diacritical marks. and This work was created by Michal Richter as an extension of his diploma thesis Advanced Czech Spellchecker. The models utilize MorfFlex CZ dictionary (http://hdl.handle.net/11858/00-097C-0000-0015-A780-9) created by Jan Hajič and Jaroslava Hlaváčová.
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
271. Czech Multiword Expressions
- Creator:
- Nevěřilová, Zuzana
- Publisher:
- Faculty of Informatics, Masaryk University
- Type:
- text, wordList, and lexicalConceptualResource
- Subject:
- multiword expressions
- Language:
- Czech
- Description:
- The dataset contains 4731 frozen continuous Czech multiword expressions. Inflectional word forms are generated for those MWEs where applicable. In total, the dataset contains 24,807 MWE forms.
- Rights:
- Public Domain Mark (PD), http://creativecommons.org/publicdomain/mark/1.0/, and PUB
272. Czech Named Entity Corpus 1.0
- Creator:
- Ševčíková, Magda, Žabokrtský, Zdeněk, and Straková, Jana
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- named entity recognition, named entitity corpus, Czech, NER, and corpus
- Language:
- Czech
- Description:
- The presented Czech Named Entity Corpus 1.0 is the first publicly available corpus providing a large body of manually annotated named entities in Czech sentences, including a fine-grained classification. and 1ET101120503 (Integrace jazykových zdrojů za účelem extrakce informací z přirozených textů)
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
273. Czech Named Entity Corpus 1.1
- Creator:
- Ševčíková, Magda, Žabokrtský, Zdeněk, Straková, Jana, and Straka, Milan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- named entity recognition and corpus
- Language:
- Czech
- Description:
- Czech Named Entity Corpus 1.1 fixes some issues of the Czech Named Entity Corpus 1.0: misannotated entities are fixed, all formats contain the same data, tmt format is replaced with treex format, all formats contain splitting into training, development and testing portion of the data. and SVV 267 314 (Teoretické základy informatiky a výpočetní lingvistiky), LM2010013 (LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat), GPP406/12/P175 (Vybrané derivační vztahy pro automatické zpracování češtiny), PRVOUK (PRVOUK)
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
274. Czech Named Entity Corpus 2.0
- Creator:
- Ševčíková, Magda, Žabokrtský, Zdeněk, Straková, Jana, and Straka, Milan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- named entity recognition
- Language:
- Czech
- Description:
- Czech Named Entity Corpus 2.0 is a corpus of 8993 Czech sentences with manually annotated 35220 Czech named entities, classified according to a two-level hierarchy of 46 named entities. and SVV 267 314 (Teoretické základy informatiky a výpočetní lingvistiky), LM2010013 (LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat), GPP406/12/P175 (Vybrané derivační vztahy pro automatické zpracování češtiny), PRVOUK (PRVOUK)
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
275. Czech OOV Inflection Dataset
- Creator:
- Sourada, Tomáš
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, computationalLexicon, and lexicalConceptualResource
- Subject:
- morphological generation, morphology, neologisms database, and Czech
- Language:
- Czech
- Description:
- Czech OOV Inflection Dataset is a Czech inflection dataset of nouns, focused on evaluation in out-of-vocabulary (OOV) conditions. It consists of two parts: a standard lemma-disjoint train-dev-test split of a subset of noun paradigms of existing morphological dictionary Czech MorfFlex 2.0 (files train, dev and test-MorfFlex); and small set of neologisms from Čeština 2.0, annotated for inflected forms (file test-neologisms).
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
276. Czech Parliament Meetings
- Creator:
- Pražák, Aleš and Šmídl, Luboš
- Publisher:
- University of West Bohemia, Department of Cybernetics
- Type:
- audio and corpus
- Subject:
- speech corpus, acoustic model, speaker identification, and speaker verification
- Language:
- Czech
- Description:
- The corpus consists of recordings from the Chamber of Deputies of the Parliament of the Czech Republic. It currently consists of 88 hours of speech data, which corresponds roughly to 0.5 million tokens. The annotation process is semi-automatic, as we are able to perform the speech recognition on the data with high accuracy (over 90%) and consequently align the resulting automatic transcripts with the speech. The annotator’s task is then to check the transcripts, correct errors, add proper punctuation and label speech sections with information about the speaker. The resulting corpus is therefore suitable for both acoustic model training for ASR purposes and training of speaker identification and/or verification systems. The archive contains 18 sound files (WAV PCM, 16-bit, 44.1 kHz, mono) and corresponding transcriptions in XML-based standard Transcriber format (http://trans.sourceforge.net) The date of airing of a particular recording is encoded in the filename in the form SOUND_YYMMDD_*. Note that the recordings are usually aired in the early morning on the day following the actual Parliament session. If the recording is too long to fit in the broadcasting scheme, it is divided into several parts and aired on the consecutive days.
- Rights:
- Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB
277. Czech PDT-C 1.0 Model for UDPipe 2 (2023-11-16)
- Creator:
- Straka, Milan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- tool and toolService
- Subject:
- tokenizer, POS tagger, lemmatization, parser, dependency parser, MorfFlex CZ 2.0, and PDT-C 1.0
- Language:
- Czech
- Description:
- Tokenizer, POS Tagger, Lemmatizer, and Parser model based on the PDT-C 1.0 treebank (https://hdl.handle.net/11234/1-3185). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#czech_pdtc1.0_model . To use these models, you need UDPipe version 2.1, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
278. Czech Relationship Extraction Dataset
- Creator:
- Šimečková, Zuzana and Straka, Milan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- entity relationship and relationship extraction
- Language:
- Czech
- Description:
- CERED (Czech Relationship Dataset) is a family of datasets created via distant supervision on Czech Wikipedia and Wikidata. It was created as part of a thesis on Relationship Extraction (2020). CERED0 is the largest dataset, it lacks negative relation and its relation inventory is huge. CERED*n* is a subset of CERED*n-1* that satisfies some conditions. The methodology of curating the datasets is detailed in the thesis. The format of the data is jsonL and the tools used to generate the dataset is python.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
279. Czech restaurant information dataset for NLG
- Creator:
- Dušek, Ondřej, Jurčíček, Filip, Dvořák, Josef, Grycová, Petra, Hejda, Matěj, Olivová, Jana, Starý, Michal, and Štichová, Eva
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- natural language generation, dialogue system, and morphological generation
- Language:
- Czech
- Description:
- This is a dataset for natural language generation (NLG) in task-oriented spoken dialogue systems with Czech as the target language. It originated as a translation of the English San Francisco Restaurants dataset by Wen et al. (2015). It includes input dialogue acts and the corresponding output natural language paraphrases in Czech. Since the dataset is intended for recurrent neural network based NLG systems using delexicalization, inflection tables for all slot values appearing verbatim in the text are provided.
- Rights:
- Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB
280. Czech RST Discourse Treebank 1.0
- Creator:
- Poláková, Lucie, Zikánová, Šárka, Mírovský, Jiří, and Hajičová, Eva
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- discourse, discourse annotation, and annotated corpus
- Language:
- Czech
- Description:
- The Czech RST Discourse Treebank 1.0 (CzRST-DT 1.0) is a dataset of 54 Czech journalistic texts manually annotated using the Rhetorical Structure Theory (RST). Each text document in the treebank is represented as a single tree-like structure, the nodes (discourse units) are interconnected through hierarchical rhetorical relations. The dataset also contains concurrent annotations of five double-annotated documents. The original texts are a part of the data annotated in the Prague Dependency Treebank, although the two projects are independent.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
281. Czech Senior COMPANION Expressive Speech Corpus
- Creator:
- Grůber, Martin
- Publisher:
- University of West Bohemia
- Type:
- audio and corpus
- Subject:
- speech corpus, expressive, and text-to-speech synthesis
- Language:
- Czech
- Description:
- The corpus contains Czech expressive speech recorded using scenario-based approach by a professional female speaker. The scenario was created on the basis of previously recorded natural dialogues between a computer and seniors. and European Commission Sixth Framework Programme Information Society Technologies Integrated Project IST-34434
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
282. Czech Sociological Review 1993-2016
- Creator:
- Hladik, Radim
- Publisher:
- Institute of Philosophy of the Czech Academy of Sciences
- Type:
- text and corpus
- Subject:
- sociology, academic writing, scholarly writing, and journal
- Language:
- Czech
- Description:
- Selected research articles and essays published in Czech Sociological Review from 1993 to 2016. Originally Czech, non-translated material only. 522 documents in total.
- Rights:
- The MIT License (MIT), http://opensource.org/licenses/mit-license.php, and PUB
283. Czech SubLex 1.0
- Creator:
- Veselovská, Kateřina and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, lexicalConceptualResource, and wordList
- Subject:
- subjectivity lexicon, sentiment analysis, opinion mining, and polarity clues
- Language:
- Czech
- Description:
- Czech subjectivity lexicon, i.e. a list of subjectivity clues for sentiment analysis in Czech. The list contains 4626 evaluative items (1672 positive and 2954 negative) together with their part of speech tags, polarity orientation and source information. The core of the Czech subjectivity lexicon has been gained by automatic translation of a freely available English subjectivity lexicon downloaded from http://www.cs.pitt.edu/mpqa/subj_lexicon.html. For translating the data into Czech, we used parallel corpus CzEng 1.0 containing 15 million parallel sentences (233 million English and 206 million Czech tokens) from seven different types of sources automatically annotated at surface and deep layers of syntactic representation. Afterwards, the lexicon has been manually refined by an experienced annotator. and The work on this project has been supported by the GAUK 3537/2011 grant and by SVV project number 267 314.
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
284. Czech Television News Broadcasting Faces
- Creator:
- Hrúz, Marek
- Publisher:
- University of West Bohemia, Department of Cybernetics
- Type:
- video and corpus
- Subject:
- video, czech news broadcasting, faces, and face tracking
- Language:
- Czech
- Description:
- The corpus contains video files of Czech Television News Broadcasts and JSON files with annotations of faces that appear in the broadcasts. The annotations are composed of frames in which a face is seen, name of the person whose face is seen, gender of the person (male/female), and the image region containing the face. The intended use of the corpus is to train models of faces for face detection, face identification, face verification, and face tracking. For convinience two different JSON files are provided. They contain the same data, but in different arrangements. One file has the identity of the person on the top, the other has the object ID on the top, where the object is a facetrack. A demo python skript is available for showing how to access the data.
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
285. Czech Text Document Corpus v 2.0
- Creator:
- Král, Pavel and Lenc, Ladislav
- Publisher:
- European Language Resources Association (ELRA)
- Type:
- text and corpus
- Subject:
- corpus, Czech, document classification, multi-label, and text
- Language:
- Czech
- Description:
- BASIC INFORMATION -------------------- Czech Text Document Corpus v 2.0 is a collection of text documents for automatic document classification in Czech language. It is composed of the text documents provided by the Czech News Agency and is freely available for research purposes. This corpus was created in order to facilitate a straightforward comparison of the document classification approaches on Czech data. It is particularly dedicated to evaluation of multi-label document classification approaches, because one document is usually labelled with more than one label. Besides the information about the document classes, the corpus is also annotated at the morphological layer. The main part (for training and testing) is composed of 11,955 real newspaper articles. We provide also a development set which is intended to be used for tuning of the hyper-parameters of the created models. This set contains 2735 additional articles. The total category number is 60 out of which 37 most frequent ones are used for classification. The reason of this reduction is to keep only the classes with the sufficient number of occurrences to train the models. Technical Details ------------------------ Text documents are stored in the individual text files using UTF-8 encoding. Each filename is composed of the serial number and the list of the categories abbreviations separated by the underscore symbol and the .txt suffix. Serial numbers are composed of five digits and the numerical series starts from the value one. For instance the file 00046_kul_nab_mag.txt represents the document file number 46 annotated by the categories kul (culture), nab (religion) and mag (magazine selection). The content of the document, i.e. the word tokens, is stored in one line. The tokens are separated by the space symbols. Every text document was further automatically mophologically analyzed. This analysis includes lemmatization, POS tagging and syntactic parsing. The fully annotated files are stored in .conll files. We also provide the lemmatized form, file with suffix .lemma, and appropriate POS-tags, see .pos files. The tokenized version of the documents is also available in .tok files. This corpus is available only for research purposes for free. Commercial use in any form is strictly excluded.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
286. Czech Translation of SQuAD 2.0 and 1.1
- Creator:
- Macková, Kateřina and Straka, Milan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- SQuAD and reading comprehension
- Language:
- Czech
- Description:
- The Czech translation of SQuAD 2.0 and SQuAD 1.1 datasets contains automatically translated texts, questions and answers from the training set and the development set of the respective datasets. The test set is missing, because it is not publicly available. The data is released under the CC BY-NC-SA 4.0 license. If you use the dataset, please cite the following paper (the exact format was not available during the submission of the dataset): Kateřina Macková and Straka Milan: Reading Comprehension in Czech via Machine Translation and Cross-lingual Transfer, presented at TSD 2020, Brno, Czech Republic, September 8-11 2020.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
287. Czech translation of the EBUContentGenre thesaurus
- Creator:
- Ircing, Pavel
- Publisher:
- University of West Bohemia, Department of Cybernetics
- Type:
- text, lexicalConceptualResource, and thesaurus
- Subject:
- thesaurus, metadata annotation, and topic detection
- Language:
- Czech and English
- Description:
- The EBUContentGenre is a thesaurus containing the hierarchical description of various genres utilized in the TV broadcasting industry. This thesaurus is a part of a complex metadata specification called EBUCore intended for multifaceted description of audiovisual content. EBUCore (http://tech.ebu.ch/docs/tech/tech3293v1_3.pdf) is a set of descriptive and technical metadata based on the Dublin Core and adapted to media. EBUCore is the flagship metadata specification of European Broadcasting Union, the largest professional association of broadcasters around the world. It is developed and maintained by EBU's Technical Department (http://tech.ebu.ch). The translated thesaurus can be used for effective cataloguing of (mostly TV) audiovisual content and consequent development of systems for automatic cataloguing (topic/genre detection). and Technology Agency of the Czech Republic, project No. TA01011264
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
288. Czech Verbal MWEs
- Creator:
- Bejček, Eduard
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, lexicon, and lexicalConceptualResource
- Subject:
- lexicon, verbs, multiword expressions, forms, and lemmatization
- Language:
- Czech
- Description:
- Lexicon of Czech verbal multiword expressions (VMWEs) used in Parseme Shared Task 2017. https://typo.uni-konstanz.de/parseme/index.php/2-general/142-parseme-shared-task-on-automatic-detection-of-verbal-mwes Lexicon consists of 4785 VMWEs, categorized into four categories according to Parseme Shared Task (PST) typology: IReflV (inherently reflexive verbs), LVC (light verb constructions), ID (idiomatic expressions) and OTH (other VMWEs with other than verbal syntactic head). Verbal multiword expressions as well as deverbative variants of VMWEs were annotated during the preparation phase of PST. These data were published as http://hdl.handle.net/11372/LRT-2282. Czech part includes 14,536 VMWE occurences: 1611 ID 10000 IReflV 2923 LVC 2 OTH This lexicon was created out of Czech data. Each lexicon entry is represented by one line in the form: type lemmas frequency PoS [used form 1; used form 2; ... ] (columns are separated by tabs) where: type ... is the type of VMWE in PST typology lemmas ... are space separated lemmatized forms of all words that constitutes the VMWE frequency ... is the absolute frequency of this item in PST data PoS ... is a space separated list of parts of speech of individual words (in the same order as in "lemmas") final field contains a list of all (1 to 18) used forms found in the data (since Czech is a flective language).
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
289. Czech WordNet 1.9 PDT
- Creator:
- Pala, Karel, Čapek, Tomáš, Zajíčková, Barbora, Bartůšková, Dita, Kulková, Kateřina, Hoffmannová, Petra, Bejček, Eduard, Straňák, Pavel, and Hajič, Jan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- ontology, wordnet, and Czech WordNet
- Language:
- Czech
- Description:
- A slightly modified version of the Czech Wordnet. This is the version used to annotate "The Lexico-Semantic Annotation of PDT using Czech WordNet": http://hdl.handle.net/11858/00-097C-0000-0001-487A-4 The Czech WordNet was developed by the Centre of Natural Language Processing at the Faculty of Informatics, Masaryk University, Czech Republic. The Czech WordNet captures nouns, verbs, adjectives, and partly adverbs, and contains 23,094 word senses (synsets). 203 of these were created or modified by UFAL during correction of annotations. This version of WordNet was used to annotate word senses in PDT: http://hdl.handle.net/11858/00-097C-0000-0001-487A-4 A more recent version of Czech WordNet is distributed by ELRA: http://catalog.elra.info/product_info.php?products_id=1089 and 1ET201120505, LM2010013
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
290. Czech Youth Digging Trenches
- Creator:
- Aktualita
- Publisher:
- Národní filmový archiv
- Type:
- video and clip
- Subject:
- akce Kuratorium pro výchovu mládeže, Kuratorium pro výchovu mládeže akce, mládež pracující, motyky, rýče, lopaty, zákopy hloubení, Kuratorium, People::Teuner František (1911-1978), and Český zvukový týdeník Aktualita::1945/4AB
- Language:
- Czech
- Description:
- Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 4A, B from 1945 shows how, in early 1945, Czech agricultural youth were involved in digging trenches as a part of their forced labour (Totaleinsatz). Their work was supervised by instructors of the Board of Trustees for the Education of Youth. General Secretary of the Board František Teuner arrived to inspect their progress.
- Rights:
- http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
291. Czech Youth Helping with the Harvest
- Creator:
- Aktualita
- Publisher:
- Národní filmový archiv
- Type:
- video and clip
- Subject:
- brigádníci na žních, brigáda zemědělská, žně, vozy tažené voly, občerstvení na poli, práce polní, mlátička obilí, akce Kuratorium pro výchovu mládeže, Kuratorium pro výchovu mládeže akce, Kuratorium, and Český zvukový týdeník Aktualita::1943/32
- Language:
- Czech
- Description:
- Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 32B from 1943 was shot during an event organised by the Board of Trustees for the Education of Youth in the summer of 1943. Czech youth helped with harvesting as part of their mandatory service.
- Rights:
- http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
292. Czech-English Manual Word Alignment
- Creator:
- Mareček, David
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- word alignment and parallel corpus
- Language:
- Czech and English
- Description:
- Corpus of manually aligned Czech-English parallel sentences. It comprises 2500 parallel sentences from 7 different sources.
- Rights:
- Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB
293. Czech-English Parallel Corpus 1.0 (CzEng 1.0)
- Creator:
- Bojar, Ondřej, Žabokrtský, Zdeněk, Dušek, Ondřej, Galuščáková, Petra, Majliš, Martin, Mareček, David, Maršík, Jiří, Novák, Michal, Popel, Martin, and Tamchyna, Aleš
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- corpus, parallel corpus, treebank, and alignment
- Language:
- Czech and English
- Description:
- CzEng 1.0 is the fourth release of a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL) freely available for non-commercial research purposes. CzEng 1.0 contains 15 million parallel sentences (233 million English and 206 million Czech tokens) from seven different types of sources automatically annotated at surface and deep (a- and t-) layers of syntactic representation. and EuroMatrix Plus (FP7-ICT-2007-3-231720 of the EU and 7E09003+7E11051 of the Ministry of Education, Youth and Sports of the Czech Republic), Faust (FP7-ICT-2009-4-247762 of the EU and 7E11041 of the Ministry of Education, Youth and Sports of the Czech Republic), GAČR P406/10/P259, GAUK 116310, GAUK 4226/2011
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
294. Czech-Slovak Parallel Corpus
- Creator:
- Galuščáková, Petra, Garabík, Radovan, and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- parallel corpus and Czech-Slovak corpus
- Language:
- Slovak and Czech
- Description:
- Czech-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] – EMEA, EUConst, KDE4 and PHP) and downloaded website of European Commission [5]. Corpus is published in both in plaintext format and with an automatic morphological annotation. References: [1] http://langtech.jrc.it/JRC-Acquis.html/ [2] http://www.statmt.org/europarl/ [3] http://apertium.eu/data [4] http://opus.lingfil.uu.se/ [5] http://ec.europa.eu/ and This work has been supported by the grant Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003 of the Czech Republic)
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
295. CzeDLex 0.5
- Creator:
- Mírovský, Jiří, Synková, Pavlína, Rysová, Magdaléna, and Poláková, Lucie
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, lexicon, and lexicalConceptualResource
- Subject:
- lexicon and discourse annotation
- Language:
- Czech
- Description:
- CzeDLex 0.5 is a pilot version of a lexicon of Czech discourse connectives. The lexicon contains connectives partially automatically extracted from the Prague Discourse Treebank 2.0 (PDiT 2.0), a large corpus annotated manually with discourse relations. The most frequent entries in the lexicon (covering more than 2/3 of the discourse relations annotated in the PDiT 2.0) have been manually checked, translated to English and supplemented with additional linguistic information.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
296. CzeDLex 0.6
- Creator:
- Synková, Pavlína, Poláková, Lucie, Mírovský, Jiří, and Rysová, Magdaléna
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, lexicon, and lexicalConceptualResource
- Subject:
- lexicon and discourse annotation
- Language:
- Czech
- Description:
- CzeDLex 0.6 is the second development version of the lexicon of Czech discourse connectives. The lexicon contains connectives partially automatically extracted from the Prague Discourse Treebank 2.0 (PDiT 2.0), a large corpus annotated manually with discourse relations. The most frequent entries in the lexicon (76 out of total 204 entries, covering more than 90% of the discourse relations annotated in PDiT 2.0), have been manually checked, translated to English and supplemented with additional linguistic information.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
297. CzeDLex 0.7
- Creator:
- Poláková, Lucie, Mírovský, Jiří, Synková, Pavlína, Kloudová, Věra, and Rysová, Magdaléna
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, lexicon, and lexicalConceptualResource
- Subject:
- lexicon and discourse annotation
- Language:
- Czech
- Description:
- CzeDLex 0.7 is the third development version of the Lexicon of Czech discourse connectives. The lexicon contains connectives partially automatically extracted from the Prague Discourse Treebank 2.0 (PDiT 2.0) and, as a supplementary resource, the Czech part of the Prague Czech–English Dependency Treebank with discourse annotation projected from the Penn Discourse Treebank 3.0. The most frequent entries in the lexicon (131 out of total 218 entries, covering more than 95% of discourse relations annotated in PDiT 2.0), have been manually checked, translated to English and supplemented with additional linguistic information.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
298. CzeDLex 1.0
- Creator:
- Mírovský, Jiří, Synková, Pavlína, Poláková, Lucie, Kloudová, Věra, and Rysová, Magdaléna
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, lexicon, and lexicalConceptualResource
- Subject:
- lexicon and discourse
- Language:
- Czech
- Description:
- CzeDLex 1.0 is the first production version (the fourth development version) of the Lexicon of Czech discourse connectives. The lexicon contains connectives partially automatically extracted from resources annotated manually with discourse relations: the Prague Discourse Treebank 2.0 (PDiT 2.0) as the primary resource, and two supplementary resources: (i) the Czech part of the Prague Czech–English Dependency Treebank with discourse annotation projected from the Penn Discourse Treebank 3.0, and (ii) a thousand sentences selected from various fiction novels and transcriptions of public speeches. All 200 entries in the lexicon have been manually checked, translated to English and supplemented with additional linguistic information.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
299. CzEng 0.7
- Creator:
- Bojar, Ondřej, Žabokrtský, Zdeněk, Češka, Pavel, Beňa, Peter, and Janíček, Miroslav
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- parallel corpus
- Language:
- Czech and English
- Description:
- CzEng 0.7 is a Czech-English parallel corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL), Charles University, Prague. The corpus contains no manual annotation. It is limited only to texts which have been already available in an electronic form and which are not protected by authors' rights in the Czech Republic. The main purpose of the corpus is to support Czech-English and English-Czech machine translation research with the necessary data. CzEng 0.7 consists of a large set of parallel textual documents mainly from the fields of European law, information technology, and fiction, all of them converted into a uniform XML-based file format and provided with automatic sentence alignment.
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
300. CzEngClass 0.1
- Creator:
- Urešová, Zdeňka, Fučíková, Eva, Hajičová, Eva, and Hajič, Jan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, lexicon, and lexicalConceptualResource
- Subject:
- verbal valency, predicate argument structure, semantic roles, bilingual corpus annotation, translational equivalence, comparative syntax, and comparative semantics
- Language:
- English and Czech
- Description:
- The CzEngClass synonym verb lexicon is a result of a project investigating semantic ‘equivalence’ of verb senses and their valency behavior in parallel Czech-English language resources, i.e., relating verb meanings with respect to contextually-based verb synonymy. The lexicon entries are linked to PDT-Vallex (http://hdl.handle.net/11858/00-097C-0000-0023-4338-F), EngVallex (http://hdl.handle.net/11858/00-097C-0000-0023-4337-2), CzEngVallex (http://hdl.handle.net/11234/1-1512), FrameNet (https://framenet.icsi.berkeley.edu/fndrupal/), VerbNet (http://verbs.colorado.edu/verbnet/index.html), PropBank (http://verbs.colorado.edu/%7Empalmer/projects/ace.html), Ontonotes (http://verbs.colorado.edu/html_groupings/), and Czech (http://hdl.handle.net/11858/00-097C-0000-0001-4880-3) and English Wordnets (https://wordnet.princeton.edu/). Part of the dataset is a file reflecting annotators choices for assignment of verbs to classes.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB