Language: Czech / Rights: PUB - LINDAT/CLARIAH-CZ Catalog Search Results

1. A Gift of an Ambulance Train to the German Army

Creator:: Aktualita
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: vlak sanitní, projev veřejný, automobil, nádraží slavnostně vyzdobené, kříže hákové, busta Hitler Adolf, důstojníci němečtí, vojáci němečtí nastoupení, orlice říšskoněmecká na vlaku, vlak sanitní část lůžková, vlak sanitní ošetřovna, vlak sanitní kuchyně, Heydrichiáda, Places::Praha::Nové Město::Hlavní nádraží, People::Hácha Emil (1872-1945), People::Heydrich Reinhard (1904-1942), People::Krejčí Jaroslav (1892-1956), People::Moravec Emanuel (1893-1945), People::Frank Karl Hermann (1898-1946), and Český zvukový týdeník Aktualita::1942/17
Language:: Czech
Description:: Segment from Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) 1942, issue no. 17, captures the presentation of a gift Ï Ambulance Train no. 751 Ï from the Protectorate of Bohemia and Moravia to Adolf Hitler and the German army. The train handover took place at Prague Main Railway Station on 20 April 1942, the birthday of Adolf Hitler. Cars arrive in front of Prague Main Railway Station. Acting Reich Protector Reinhard Heydrich enters the train station. State President Emil Hácha gives a speech in the festively decorated railway hall. In response, Heydrich shakes his hand. The event is witnessed by a delegation of railway workers. The train crew lines up on the station platform. Heydrich enters the train with his entourage and inspects the sleeping cars, the operating carriage, the kitchen, and the sick bay. The inspection of the ambulance train is attended by Protectorate Prime Minister Jaroslav Krejčí and Minister of Education and People´s Enlightenment Emanuel Moravec. According to the voiceover, the train was made in a railway workshop in Prague-Bubny in record time. It consisted of 28 carriages and 20 hospital carriages, was 410 metres long, weighed 545 tons and had capacity for 280 wounded.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

2. A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents

Creator:: Novotný, Vít, Luger, Kristýna, Štefánik, Michal, Vrabcová, Tereza, and Horák, Aleš
Publisher:: Masaryk University, Brno
Type:: text and corpus
Subject:: NER, named entity recognition, and Medieval
Language:: Czech, English, German, and Latin
Description:: This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
Rights:: Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB

3. A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents (2023-01-05)

Creator:: Novotný, Vít, Luger, Kristýna, Štefánik, Michal, Vrabcová, Tereza, and Horák, Aleš
Publisher:: Masaryk University, Brno
Type:: text and corpus
Subject:: NER, named entity recognition, and Medieval
Language:: Czech, English, German, and Latin
Description:: This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
Rights:: Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB

4. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents

Creator:: Novotný, Vít, Seidlová, Kristýna, Vrabcová, Tereza, and Horák, Aleš
Publisher:: Masaryk University, Brno
Type:: image and corpus
Subject:: ocr, optical character recognition, language identification, image super-resolution, sr, and Medieval
Language:: German, Czech, Latin, and English
Description:: This is an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification.
Rights:: Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB

5. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents: Supplementary Materials

Creator:: Novotný, Vít and Horák, Aleš
Publisher:: Masaryk University, Brno
Type:: text and corpus
Subject:: ocr, optical character recognition, language identification, image super-resolution, sr, and Medieval
Language:: Czech, English, German, and Latin
Description:: These are supplementary materials for an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification and is available at http://hdl.handle.net/11234/1-4615. These supplementary materials contain OCR texts from different OCR engines for book pages for which we have both high-resolution scanned images and annotations for OCR evaluation.
Rights:: Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB

6. A Manifestation for Reinhard Heydrich at the ND

Creator:: Aktualita
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: Heydrichiáda, tryzna Heydrich Reinhard, divadlo interiér, lóže divadelní, orlice říšskoněmecká, znak zemský Čechy, znak zemský Morava, busta Heydrich Reinhard, projevy veřejné, lidé tleskající, hajlování, lidé hajlující, manifestace divadelníků, Národní divadlo, Places::Praha::Nové Město::Národní divadlo /int./, People::Krejčí Jaroslav (1892-1956), People::Deyl Rudolf st. (1876-1972), People::Moravec Emanuel (1893-1945), People::Nasková Růžena (1884-1960), People::Höger Karel (1909-1977), People::Futurista Ferenc (1891-1947), People::Neumann Stanislav (1902-1975), People::Nový Oldřich (1899-1983), People::Šejbalová Jiřina (1905-1981), People::Baldová Zdenka (1885-1958), People::Průcha Jaroslav (1898-1963), and Český zvukový týdeník Aktualita::1942/27A
Language:: Czech
Description:: Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) 1942, issue no. 27A, captures the Pledge of Czech Theatre Professionals´ Allegiance to the Reich, a manifestation held at the National Theatre in Prague on 25 June 1942, which was to unequivocally condemn the assassination of Acting Reich Protector Reinhard Heydrich. Speeches are delivered by actor Rudolf Deyl Jr. and Minister of Education and People´s Enlightenment Emanuel Moravec (silent). Actress Růžena Nasková and actors Karel Höger, Ferenc Futurista, and Stanislav Neumann are seen among the participants. The segment concludes with everyone performing the Nazi salute.
Rights:: Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB

7. A Small Dataset for English-to-Czech Speech Translation in the Travel Domain

Creator:: Cífka, Ondřej and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: audio and corpus
Subject:: speech corpus, ASR, and machine translation
Language:: English and Czech
Description:: This small dataset contains 3 speech corpora collected using the Alex Translate telephone service (https://ufal.mff.cuni.cz/alex#alex-translate). The "part1" and "part2" corpora contain English speech with transcriptions and Czech translations. These recordings were collected from users of the service. Part 1 contains earlier recordings, filtered to include only clean speech; Part 2 contains later recordings with no filtering applied. The "cstest" corpus contains recordings of artificially created sentences, each containing one or more Czech names of places in the Czech Republic. These were recorded by a multinational group of students studying in Prague.
Rights:: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB

8. Acting Reich Protector Reinhard Heydrich

Creator:: Aktualita
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: vojáci SS, jednotka SS nastoupená, vlajka s hákovým křížem, protektor říšský, kapela vojenská německá, automobil přijíždějící, Heydrichiáda, Places::Praha::Hradčany::Pražský hrad, Places::Praha::Hradčany::Pražský hrad::první hradní nádvoří, People::Hácha Emil (1872-1945), and Český zvukový týdeník Aktualita::1941/40
Language:: Czech
Description:: Segment from Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) 1941, issue no. 40, captures events linked to the accession of SS-Obergruppenführer Reinhard Heydrich to the office of Deputy Reich Protector of the Protectorate of Bohemia and Moravia on 27 September 1941. Heydrich attends an SS military parade on Hradčanské Square in Prague. Military dignitaries and state officials welcome him in the first quadrangle of Prague Castle. The Nazi flag flies over Prague Castle. Reich Commissioner for the Sudetenland Konrad Henlein and Reich Secretary Karl Hermann Frank are present at the occasion. State President Emil Hácha receives Reinhard Heydrich at Prague Castle.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

9. Additional German-Czech reference translations of the WMT'11 test set

Creator:: Bojar, Ondřej, Zeman, Daniel, Dušek, Ondřej, Břečková, Jana, Farkačová, Hana, Grošpic, Pavel, Kačenová, Kristýna, Knechtová, Eva, Koubová, Anna, Lukavská, Jana, Nováková, Petra, and Petrdlíková, Jana
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: reference translation, German-Czech, and parallel corpus
Language:: German and Czech
Description:: Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.org/wmt11/test.tgz), translated from the German originals. Original segmentation of the WMT 2011 data is preserved. and This project has been sponsored by the grants GAČR P406/11/1499 and EuroMatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003+7E11051 of the Ministry of Education, Youth and Sports of the Czech Republic)
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

13. AKCES 2

Creator:: Šebesta, Karel and Goláňová, Hana
Publisher:: Charles University in Prague, ÚČJTK
Type:: text and corpus
Subject:: youth language, classroom, language acquisition corpus, and AKCES
Language:: Czech
Description:: Corpus AKCES 2 consists of trancripts of recordings of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora). It is the same data as the corpus "Schola 2010" (see the link for search), but all the proper names have been removed in order to protect the privacy of participants. and MŠMT (MSM0021620825), UK (PRVOUK P 10)
Rights:: Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB

14. AKCES 2 ver. 2

Creator:: Šebesta, Karel and Goláňová, Hana
Publisher:: Charles University in Prague, ÚČJTK
Type:: text and corpus
Subject:: youth language, classroom, language acquisition corpus, and AKCES
Language:: Czech
Description:: Corpus AKCES 2 ver. 2 consists of full, unabridged trancripts of recordings of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora). It is the same data as the corpus "Schola 2010" (see the link for search), but all the proper names have been removed in order to protect the privacy of participants. and UK, PRVOUK P10
Rights:: Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB

15. AKCES 3

Creator:: Šebesta, Karel, Bedřichová, Zuzanna, Šormová, Kateřina, Štindlová, Barbora, Hrdlička, Milan, Hrdličková, Tereza, Hana, Jiří, Rosen, Alexandr, Petkevič, Vladimír, Jelínek, Tomáš, Škodová, Svatava, Poláčková, Marie, Janeš, Petr, Lundáková, Kateřina, Skoumalová, Hana, Šťastný, Klement, Sládek, Šimon, and Pierscieniak, Piotr
Publisher:: Charles University in Prague, ÚČJTK
Type:: text and corpus
Subject:: Czech as a foreign language, Czech language acquisition corpora, non-native speakers, AKCES, and second language aquisition
Language:: Czech
Description:: Corpus AKCES 3 includes texts written in czech by non-native speakers (AKCES/CLAC - Czech Language Acquisition Corpora) and ESF (OPVK CZ.1.07/2.2.00/07.0259), MŠMT (MSM0021620825), UK (P10)
Rights:: Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB

16. AKCES 4

Creator:: Šebesta, Karel, Bedřichová, Zuzanna, Štindlová, Barbora, Hrdlička, Milan, Hrdličková, Tereza, Hana, Jiří, Rosen, Alexandr, Petkevič, Vladimír, Jelínek, Tomáš, Škodová, Svatava, Janeš, Petr, Lundáková, Kateřina, Skoumalová, Hana, Šťastný, Klement, and Sládek, Šimon
Publisher:: Charles University
Type:: text and corpus
Subject:: language of children, Czech language acquisition, adolescents, and AKCES
Language:: Czech
Description:: Corpus AKCES 4 includes texts written in czech by youth growing up in locations at risk of social exclusion (AKCES/CLAC - Czech Language Acquisition Corpora) and ESF (OPVK CZ.1.07/2.2.00/07.0259), MŠMT (MSM0021620825), UK (P10)
Rights:: Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB

17. AKCES 5 (CzeSL-SGT)

Creator:: Šebesta, Karel, Bedřichová, Zuzanna, Šormová, Kateřina, Štindlová, Barbora, Hrdlička, Milan, Hrdličková, Tereza, Hana, Jiří, Petkevič, Vladimír, Jelínek, Tomáš, Škodová, Svatava, Poláčková, Marie, Janeš, Petr, Lundáková, Kateřina, Skoumalová, Hana, Sládek, Šimon, Pierscieniak, Piotr, Toufarová, Dagmar, Richter, Michal, Straka, Milan, and Rosen, Alexandr
Publisher:: Charles University
Type:: text and corpus
Subject:: learner corpus, Czech as a foreign language, Czech language acquisition corpora, AKCES, non-native speakers, and second language aquisition
Language:: Czech
Description:: Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition Corpora. CzeSL-SGT stands for Czech as a Second Language with Spelling, Grammar and Tags. Extends the “foreign” (ciz) part of AKCES 3 (CzeSL-plain) by texts collected in 2013. Original forms and automatic corrections are tagged, lemmatized and assigned erros labels. Most texts have metadata attributes (30 items) about the author and the text.
Rights:: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0), http://creativecommons.org/licenses/by-sa/3.0/, and PUB

18. AKCES 5 (CzeSL-SGT) Release 2

Creator:: Šebesta, Karel, Bedřichová, Zuzanna, Šormová, Kateřina, Štindlová, Barbora, Hrdlička, Milan, Hrdličková, Tereza, Hana, Jiří, Petkevič, Vladimír, Jelínek, Tomáš, Škodová, Svatava, Poláčková, Marie, Janeš, Petr, Lundáková, Kateřina, Skoumalová, Hana, Sládek, Šimon, Pierscieniak, Piotr, Toufarová, Dagmar, Richter, Michal, Straka, Milan, and Rosen, Alexandr
Publisher:: Charles University
Type:: text and corpus
Subject:: learner corpus, Czech as a foreign language, Czech language acquisition corpora, AKCES, non-native speakers, and second language acquistion
Language:: Czech
Description:: Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition Corpora. CzeSL-SGT stands for Czech as a Second Language with Spelling, Grammar and Tags. Extends the “foreign” (ciz) part of AKCES 3 (CzeSL-plain) by texts collected in 2013. Original forms and automatic corrections are tagged, lemmatized and assigned erros labels. Most texts have metadata attributes (30 items) about the author and the text. In addition to a few minor bugs, fixes a critical issue in Release 1: the native speakers of Ukrainian (s_L1:"uk") were wrongly labelled as speakers of "other European languages" (s_L1_group="IE"), instead of speakers of a Slavic language (s_L1_group="S"). The file is now a regular XML document, with all annotation represented as XML attributes.
Rights:: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0), http://creativecommons.org/licenses/by-sa/3.0/, and PUB

19. AKCES-GEC Grammatical Error Correction Dataset for Czech

Creator:: Šebesta, Karel, Bedřichová, Zuzanna, Šormová, Kateřina, Štindlová, Barbora, Hrdlička, Milan, Hrdličková, Tereza, Hana, Jiří, Petkevič, Vladimír, Jelínek, Tomáš, Škodová, Svatava, Janeš, Petr, Lundáková, Kateřina, Skoumalová, Hana, Sládek, Šimon, Pierscieniak, Piotr, Toufarová, Dagmar, Straka, Milan, Rosen, Alexandr, Náplava, Jakub, and Poláčková, Marie
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: natural language correction, grammatical error correction, and gec
Language:: Czech
Description:: AKCES-GEC is a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format. Note that in comparison to CZESL-GEC dataset, this dataset contains separated edits together with their type annotations in M2 format and also has two times more sentences. If you use this dataset, please use following citation: @article{naplava2019wnut, title={Grammatical Error Correction in Low-Resource Scenarios}, author={N{\'a}plava, Jakub and Straka, Milan}, journal={arXiv preprint arXiv:1910.00353}, year={2019} }
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

20. Alfréd Hořice (ornithologist)

Creator:: Aktualita and Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: narozeniny Hořice Alfréd, ptáci vycpaní vycpaní, vitrina s vycpanými ptáky, Galerie osobností, People::Hořice Alfréd (1865-1945), and Český zvukový týdeník Aktualita::1945/19
Language:: Czech
Description:: Ornithologist Alfréd Hořice with his collection of stuffed birds in a fragmented segment from Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) 1945, issue no. 19.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

21. Alfréd Nikodém (cold water swimmer)

Creator:: Aktualita and Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: závody plavecké, řeka, diváci na závodech plaveckých, mosty pražské, kytice pro plavce, Galerie osobností, Places::Praha::řeka Vltava, Places::Praha::Karlův most, Places::Praha::Čechův most, Places::řeka Vltava, People::Nikodém Alfred (1864-1949), and Český zvukový týdeník Aktualita::1942/36
Language:: Czech
Description:: Cold water swimmer Alfréd Nikodém as the oldest participant in a swimming race in the Vltava River in a segment from Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) 1942, issue no. 36. Nikodém with a bouquet of flowers by Svatopluk Čech Bridge.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

22. Alois Hába (composer)

Creator:: Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: nástroj hudební klavír and Galerie osobností
Language:: Czech
Description:: Composer Alois Hába with an unidentified young woman on Bohumil Veselý's balcony, and playing the piano in a fragmented segment from a film newsreel.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

23. Alois Jalovec (cinema pioneer)

Creator:: Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: Galerie osobností, People::Jalovec Alois (1867-1932), and People::Jalovcová (neuvedeno-)
Language:: Czech
Description:: Cinema pioneer Alois Jalovec with his wife and children in family photographs.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

24. Alois Klíma (conductor)

Creator:: Krátký film and Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: sál koncertní, orchestr Symfonický orchestr Českého rozhlasu, dirigent symfonického orchestru, Pražské jaro 1952, akce Pražské jaro, Galerie osobností, Places::Praha::Staré Město::náměstí Republiky::Obecní dům::Smetanova síň, People::Klíma Alois (1905-1980), People::Kofránek Ladislav Jan (1880-1954), and Československé filmové noviny 1952/24
Language:: Czech
Description:: Conductor Alois Klíma conducts the Prague Radio Symphony Orchestra at the Prague Spring International Music Festival in a segment from Československé filmové noviny (Czechoslovak Film News) 1952, issue no. 24. The orchestra performs Bedřich Smetana´s symphony Tábor.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

25. Alois Schneiderka (painter)

Creator:: Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: ateliér malířský, maliř při práci, obrazy Schneiderka Alois, paleta malířská, dýmka, Galerie osobností, Places::Soláň::dům a ateliér Aloise Schneiderky /int./, Places::hora Soláň::dům a ateliér Aloise Schneiderky /int./, and People::Schneiderka Alois (1896-1958)
Language:: Czech
Description:: Painter Alois Schneiderka working in his studio.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

26. American Slovaks Meeting President Beneš

Creator:: Aktualita
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: Slovenská liga v Americe, Pittsburská dohoda, vztahy ČSR-Slovenská Liga v Americe, Mnichovská dohoda, People::Beneš Edvard (1884-1948), People::Hletko Peter Pavol (1902-1973), People::Hušek Jozef (1880-1947), People::Novák Andrej (neuvedeno-), People::Rolík Andrej (neuvedeno-), People::Sloboda Dominik (neuvedeno-), and Český zvukový týdeník Aktualita::1938/23
Language:: Czech
Description:: The segment of Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel), 1938, issue no. 23 captures Peter Hletko´s speech on the importance of the Pittsburgh Agreement and continues with a report on the meeting between the five-member delegation of the American Slovak League, led by Peter Hletko, and President Edvard Beneš, which was held in Prague on 30 May 1938.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

27. Anna Roškotová (painter)

Creator:: Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: ateliér malířský, obrazy Roškotová Anna, stojan malířský, malířka při práci, Galerie osobností, and People::Roškotová Anna (1883-1967)
Language:: Czech
Description:: Malířka Anna Roškotová ve svém ateliéru.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

28. Annotated corpora and tools of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (edition 1.0)

Creator:: Savary, Agata, Ramisch, Carlos, Cordeiro, Silvio Ricardo, Sangati, Federico, Vincze, Veronika, QasemiZadeh, Behrang, Candito, Marie, Cap, Fabienne, Giouli, Voula, Stoyanova, Ivelina, Doucet, Antoine, Adalı, Kübra, Barbu Mititelu, Verginica, Bejček, Eduard, El Maarouf, Ismail, Eryiğit, Gülşen, Galea, Luke, Ha-Cohen Kerner, Yaakov, Liebeskind, Chaya, Monti, Johanna, Parra Escartín, Carla, Kovalevskaitė, Jolanta, Krek, Simon, van der Plas, Lonneke, Aceta, Cristina, Aduriz, Itziar, Antoine, Jean-Yves, Attard, Greta, Azzopardi, Kirsty, Boizou, Loic, Bonnici, Janice, Boz, Mert, Bumbulienė, Ieva, Busuttil, Jael, Caruso, Valeria, Cherchi, Manuela, Constant, Matthieu, Czerepowicka, Monika, De Santis, Anna, Dimitrova, Tsvetana, Dinç, Tutkum, Elyovich, Hevi, Fabri, Ray, Farrugia, Alison, Findlay, Jamie, Fotopoulou, Aggeliki, Foufi, Vassiliki, Galea, Sara Anne, Gantar, Polona, Gatt, Albert, Gatt, Anabelle, Herrero, Carlos, Iñurrieta, Uxoa, Jagfeld, Glorianna, Hnátková, Milena, Ionescu, Mihaela, Klyueva, Natalia, Koeva, Svetla, Kovács, Viktória, Kuzman, Taja, Leseva, Svetlozara, Louisou, Sevi, Lynn, Teresa, Malka, Ruth, Martínez Alonso, Héctor, McCrae, John, de Medeiros Caseli, Helena, Miral, Ayşenur, Muscat, Amanda, Nivre, Joakim, Oakes, Michael, Onofrei, Mihaela, Parmentier, Yannick, Pasquer, Caroline, Pia di Buono, Maria, Priego Sanchez, Belem, Raffone, Annalisa, Ramisch, Renata, Rimkutė, Erika, Rizea, Monica-Mihaela, Simkó, Katalin, Spagnol, Michael, Stefanova, Valentina, Stymne, Sara, Sulubacak, Umut, Tabone, Nicole, Tanti, Marc, Todorova, Maria, Urešová, Zdenka, Villavicencio, Aline, and Zilio, Leonardo
Publisher:: PARSEME
Type:: text and corpus
Subject:: Multiword expressions, verbal multiword expressions, idioms, light-verb constructions, verb-particle constructions, and inherently reflexive verbs
Language:: Bulgarian, Czech, German, Modern Greek (1453-), Spanish, Persian, French, Hebrew, Hungarian, Italian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovenian, Swedish, and Turkish
Description:: The PARSEME shared task aims at identifying verbal MWEs in running texts. Verbal MWEs include idioms (let the cat out of the bag), light verb constructions (make a decision), verb-particle constructions (give up), and inherently reflexive verbs (se suicider 'to suicide' in French). VMWEs were annotated according to the universal guidelines in 18 languages. The corpora are provided in the parsemetsv format, inspired by the CONLL-U format. For most languages, paired files in the CONLL-U format - not necessarily using UD tagsets - containing parts of speech, lemmas, morphological features and/or syntactic dependencies are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe). This item contains training and test data, tools and the universal guidelines file.
Rights:: PARSEME Shared Task Data (v. 1.0) Agreement, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-mwe-1.0, and PUB

29. Annotated Corpus of Czech Case Law for Reference Recognition Tasks

Creator:: Harašta, Jakub, Šavelka, Jaromír, Kasl, František, Kotková, Adéla, Loutocký, Pavel, Míšek, Jakub, Procházková, Daniela, Pullmannová, Helena, Semenišín, Petr, Šejnová, Tamara, Šimková, Nikola, Vosinek, Michal, Zavadilová, Lucie, and Zibner, Jan
Publisher:: Masaryk University, Brno
Type:: text and corpus
Subject:: reference recognition and legal texts
Language:: Czech
Description:: Annotated corpus of 350 decision of Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court). Every decision is annotated by two trained annotators and then manually adjudicated by one trained curator to solve possible disagreements between annotators. Adjudication was conducted non-destructively, therefore dataset contains all original annotations. Corpus was developed as training and testing material for reference recognition tasks. Dataset contains references to other court decisions and literature. All references consist of basic units (identifier of court decision, identification of court issuing referred decision, author of book or article, title of book or article, point of interest in referred document etc.), values (polarity, depth of discussion etc.).
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

30. Annotated Corpus of Czech Case Law for Reference Recognition Tasks (2019-06-25)

Creator:: Harašta, Jakub, Šavelka, Jaromír, Kasl, František, Kotková, Adéla, Loutocký, Pavel, Míšek, Jakub, Procházková, Daniela, Pullmannová, Helena, Semenišín, Petr, Šejnová, Tamara, Šimková, Nikola, Vosinek, Michal, Zavadilová, Lucie, and Zibner, Jan
Publisher:: Masaryk University, Brno
Type:: text and corpus
Subject:: reference recognition and legal texts
Language:: Czech
Description:: Annotated corpus of 350 decision of Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court). Every decision is annotated by two trained annotators and then manually adjudicated by one trained curator to solve possible disagreements between annotators. Adjudication was conducted non-destructively, therefore corpus (raw) contains all original annotations. Corpus was developed as training and testing material for reference recognition tasks. Dataset contains references to other court decisions and literature. All references consist of basic units (identifier of court decision, identification of court issuing referred decision, author of book or article, title of book or article, point of interest in referred document etc.), values (polarity, depth of discussion etc.).
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

31. Annotated Corpus of Czech Case Law for Segmentation Tasks

Creator:: Harašta, Jakub, Šavelka, Jaromír, Kasl, František, and Míšek, Jakub
Publisher:: Masaryk University, Brno
Type:: text and corpus
Subject:: document segmentation and legal texts
Language:: Czech
Description:: Annotated corpus of 350 decision of Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court). 280 decisions were annotated by one trained annotator and then manually adjudicated by one trained curator. 70 decisions were annotated by two trained annotators and then manually adjudicated by one trained curator. Adjudication was conducted destructively, therefore dataset contains only the correct annotations and does not contain all original annotations. Corpus was developed as training and testing material for text segmentation tasks. Dataset contains decision segmented into Header, Procedural History, Submission/Rejoinder, Court Argumentation, Footer, Footnotes, and Dissenting Opinion. Segmentation allows to treat different parts of text differently even if it contains similar linguistic or other features.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

32. Annotation of Dramatic Situations in Theater Play Scripts

Creator:: Mareček, David, Nováková, Marie, Vosecká, Klára, Doležal, Josef, and Rosa, Rudolf
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) and The Academy of Performing Arts in Prague, Theatre Faculty (DAMU)
Type:: text and corpus
Subject:: theatre, play script, and dramatic situation
Language:: Czech
Description:: We defined 58 dramatic situations and annotated them in 19 play scripts. Then we selected only 5 well-recognized dramatic situations and annotated further 33 play scripts. In this version of the data, we release only play scripts that can be freely distributed, which is 9 play scripts. One play is annotated independently by three annotators.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

33. Antonín Martin Brousil (vice-chancellor of Prague's Academy)

Creator:: Krátký film and Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: řetěz rektorský, cena MFF Karlovy Vary, festival filmový MFF karlovy Vary, Galerie osobností, People::Revueltas Rosaura (1910-1996), People::Brousil Antonín Martin (1907-1986), and People::Plicka Karel (1894-1986)
Language:: Czech
Description:: Antonín Martin Brousil, the vice-chancellor of Prague's Academy of Performing Arts, and Mexican actress Rosaura Revueltas at the 1954 Karlovy Vary International Film Festival in a fragmented segment from the weekly film newsreel.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

34. Antonín Pelc (painter)

Creator:: Krátký film and Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: ateliér malířský, narozeniny Pelc Antonín 60., Galerie osobností, People::Pelc Antonín (1895-1967), People::Záhořová Jarmila (1924-1958), and Československé filmové noviny 1952/43
Language:: Czech
Description:: Painter Antonín Pelc with his wife Jarmila Záhořová in the studio in a segment from Československé filmové noviny (Czechoslovak Film News) 1952, issue no. 43. The painter in his studio on the day of his 60th birthday in a segment from Československý filmový týdeník (Czechoslovak Film Weekly Newsreel) 1955, issue no. 4.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

35. Antonín Přecechtěl (otolaryngologist)

Creator:: Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: otorinolaryngologie, lékař při práci, Galerie osobností, Places::Praha::Klinika nemocí ušních::ústních a hrtanových, and People::Přecechtěl Antonín (1885-1971)
Language:: Czech
Description:: Professor and otolaryngologist Antonín Přecechtěl working at the clinic.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

36. Artificial Treebank with Ellipsis

Creator:: Droganova, Kira, Zeman, Daniel, Kanerva, Jenna, and Ginter, Filip
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: universal dependencies, ellipsis, and gapping
Language:: English, Czech, Finnish, Russian, and Slovak
Description:: Artificially created treebank of elliptical constructions (gapping), in the annotation style of Universal Dependencies. Data taken from UD 2.1 release, and from large web corpora parsed by two parsers. Input data are filtered, sentences are identified where gapping could be applied, then those sentences are transformed, one or more words are omitted, resulting in a sentence with gapping. Details in Droganova et al.: Parse Me if You Can: Artificial Treebanks for Parsing Experiments on Elliptical Constructions, LREC 2018, Miyazaki, Japan.
Rights:: Licence Universal Dependencies v2.1, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.1, and PUB

37. Aspect-Term Annotated Customer Reviews in Czech

Creator:: Fiala, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: sentiment analysis, opinion target, and customer review
Language:: Czech
Description:: This dataset contains a number of user product reviews which are publicly available on the website of an established Czech online shop with electronic devices. Each review consists of negative and positive aspects of the product. This setting pushes the customer to rate important characteristics. We have selected 2000 positive and negative segments from these reviews and manually tagged their targets. Additionally, we selected 200 of the longest reviews and annotated them in the same way. The targets were either aspects of the evaluated product or some general attributes (e.g. price, ease of use).
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

38. AudioPSP 24.01: Audio recordings of proceedings of the Chamber of Deputies of the Parliament of the Czech Republic

Creator:: Kopp, Matyáš
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: audio and corpus
Subject:: Parliament of the Czech Republic
Language:: Czech
Description:: This record contains audio recordings of proceedings of the Chamber of Deputies of the Parliament of the Czech Republic. The recordings have been provided by the official websites of the Chamber of Deputies, and the set contains them in their original format with no further processing. Recordings cover all available audio files from 2013-11-25 to 2023-07-26. Audio files are packed by year (2013-2023) and quarter (Q1-Q4) in tar archives audioPSP-YYYY-QN.tar. Furthermore, there are two TSV files: audioPSP-meta.quarterArchive.tsv contains metadata about archives, and audioPSP-meta.audioFile.tsv contains metadata about individual audio files.
Rights:: Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB

39. Automatic Paraphrases of Czech Reference Sentences for WMT11, 13 and 14

Creator:: Barančíková, Petra and Tamchyna, Aleš
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: machine translation, automatic evaluation, and paraphrasing
Language:: Czech
Description:: This dataset contains automatic paraphrases of Czech official reference translations for the Workshop on Statistical Machine Translation shared task. The data covers the years 2011, 2013 and 2014. For each sentence, at most 10000 paraphrases were included (randomly selected from the full set). The goal of using this dataset is to improve automatic evaluation of machine translation outputs. If you use this work, please cite the following paper: Tamchyna Aleš, Barančíková Petra: Automatic and Manual Paraphrases for MT Evaluation. In proceedings of LREC, 2016.
Rights:: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB

40. Automatically generated spelling correction corpus for Czech (Czech-SEC-AG)

Creator:: Hajič, Jan, Náplava, Jakub, and Straka, Milan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: spelling correction and natural language correction
Language:: Czech
Description:: Automatically generated spelling correction corpus for Czech (Czesl-SEC-AG) is a corpus containg text with automatically generated spelling errors. To create spelling errors, a character error model containing probabilities of character substitution, insertion, deletion and probabilities of swaping two adjacent characters is used. Besides these probabilities, also the probabilities of changing character casing are considered. The original clean text on which the spelling errors were generated is PDT3.0 (http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3). The original train/dev/test sentence split of PDT3.0 corpus is preserved in this dataset. Besides the data with artificial spelling errors, we also publish texts from which the character error model was created. These are the original manual transcript of an audiobook Švejk and its corrected version performed by authors of Korektor (http://ufal.mff.cuni.cz/korektor). These data are similarly to CzeSL Grammatical Error Correction Dataset (CzeSL-GEC: http://hdl.handle.net/11234/1-2143) processed into four sets based on error difficulty present.
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

41. Bohumil Kafka (sculptor)

Creator:: Pečený and Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: sochař při práci, ateliér sochařský, lev v kleci, lev jako model, zahrada zoologická, pomník Štefánik Milan Rastislav, pomník Mánes Josef, pomník Mánes Josef odhalení, projev při odhalení pomníku, Galerie osobností, Places::Praha::Troja::zoologická zahrada, Places::Praha::Alšovo nábřeží::pomník Josefa Mánesa, Places::Praha::Dejvice::ateliér Bohumila Kafky, Places::Praha::Staré Město::Palachovo náměstí::Rudolfinum, People::Kafka Bohumil (1878-1942), People::Hodža Milan (1878-1944), People::Nechleba Vratislav (1885-1965), and Československý filmový týdeník 1937/5
Language:: Czech and No linguistic content
Description:: Sculptor Bohumil Kafka works on a statue of Josef Mánes in a fragmented segment from the Ufa žurnál (Ufa Journal) 1939, issue no. 200. The unveiling of the monument by the Rudolfinum, including a speech by Professor Vratislav Nechleba in a fragmented segment from Československé filmové noviny (Czechoslovak Film News) 1951, issue no. 52. Kafka at Prague Zoo working on a study of a lion for the Milan Rastislav Štefánik's monument in a segment from Československý filmový týdeník (Czechoslovak Film Weekly Newsreel) 1937, issue no. 5. Kafka with politician Milan Hodža in the artist´s studio in Prague-Dejvice.
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

42. BushBank

Creator:: Grác, Marek
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: interannotator agreement, corpus, chunks, phrases, and clauses
Language:: Czech
Description:: Czech corpus annotated for NP and clause chunks by 3-11 annotators (with average inter-annotator agreement at 88%). It consists of 10,000 sentences.
Rights:: Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB

43. C4Corpus (CC BY-NC part)

Creator:: Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
Publisher:: Technische Universität Darmstadt
Type:: text and corpus
Subject:: CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
Language:: Afrikaans, Arabic, Bengali, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Latvian, Lithuanian, Malayalam, Macedonian, Nepali (macrolanguage), Dutch, Norwegian, Panjabi, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Spanish, Albanian, Swahili (macrolanguage), Swedish, Tamil, Telugu, Tagalog, Thai, Turkish, Ukrainian, Undetermined, Vietnamese, and Chinese
Description:: A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

44. C4Corpus (CC BY-NC-ND part)

Creator:: Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
Publisher:: Technische Universität Darmstadt
Type:: text and corpus
Subject:: CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
Language:: Afrikaans, Arabic, Bengali, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Latvian, Lithuanian, Malayalam, Marathi, Macedonian, Nepali (macrolanguage), Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Spanish, Albanian, Swahili (macrolanguage), Swedish, Tamil, Telugu, Tagalog, Thai, Turkish, Ukrainian, Undetermined, Urdu, Vietnamese, and Chinese
Description:: A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:: Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB

45. C4Corpus (CC BY-NC-SA part)

Creator:: Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
Publisher:: Technische Universität Darmstadt
Type:: text and corpus
Subject:: CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
Language:: Afrikaans, Arabic, Bengali, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Malayalam, Marathi, Macedonian, Nepali (macrolanguage), Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Spanish, Albanian, Swahili (macrolanguage), Swedish, Tamil, Telugu, Tagalog, Thai, Turkish, Ukrainian, Undetermined, Urdu, Vietnamese, and Chinese
Description:: A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

46. C4Corpus (CC BY-ND part)

Creator:: Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
Publisher:: Technische Universität Darmstadt
Type:: text and corpus
Subject:: CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
Language:: Afrikaans, Arabic, Bengali, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Malayalam, Macedonian, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Spanish, Albanian, Swahili (macrolanguage), Swedish, Tamil, Tagalog, Thai, Turkish, Ukrainian, Undetermined, Vietnamese, and Chinese
Description:: A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:: Creative Commons - Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

47. C4Corpus (CC BY-SA part)

Creator:: Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
Publisher:: Technische Universität Darmstadt
Type:: text and corpus
Subject:: CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
Language:: Afrikaans, Arabic, Bengali, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Latvian, Lithuanian, Malayalam, Marathi, Macedonian, Nepali (macrolanguage), Dutch, Norwegian, Panjabi, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Spanish, Albanian, Swahili (macrolanguage), Swedish, Tamil, Telugu, Tagalog, Thai, Turkish, Ukrainian, Undetermined, Urdu, Vietnamese, and Chinese
Description:: A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB

48. C4Corpus (CC-BY part)

Creator:: Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
Publisher:: Technische Universität Darmstadt
Type:: text and corpus
Subject:: CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
Language:: Afrikaans, Arabic, Bengali, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Latvian, Lithuanian, Malayalam, Marathi, Macedonian, Nepali (macrolanguage), Dutch, Norwegian, Panjabi, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Spanish, Albanian, Swahili (macrolanguage), Swedish, Tamil, Telugu, Tagalog, Thai, Turkish, Ukrainian, Undetermined, Urdu, Vietnamese, and Chinese
Description:: A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

49. C4Corpus (publicdomain part)

Creator:: Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
Publisher:: Technische Universität Darmstadt
Type:: text and corpus
Subject:: CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
Language:: Afrikaans, Arabic, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Croatian, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Dutch, Norwegian, Polish, Portuguese, Russian, Slovenian, Somali, Spanish, Swahili (macrolanguage), Swedish, Tagalog, Thai, Turkish, Ukrainian, Undetermined, and Vietnamese
Description:: A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:: Public Domain Mark (PD), http://creativecommons.org/publicdomain/mark/1.0/, and PUB

50. CERED baseline models

Creator:: Šimečková, Zuzana and Straka, Milan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: mlmodel, text, and languageDescription
Subject:: relationship extraction
Language:: Czech
Description:: Relationship extraction models for the Czech language. Models are trained on CERED (dataset created by distant supervision on Czech Wikipedia and Wikidata) and recognize a subset of Wikidata relations (listed in CEREDx.LABELS). We supply a demo.py that performs inference on user-defined input and requirements.txt file for pip. Adapt the demo code to use the model. Both the dataset and the models are presented in Relationship Extraction thesis.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), PUB, and http://creativecommons.org/licenses/by-nc-sa/4.0/

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Creator

Show values starting with

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Show values starting with

Date

Original context has metadata only

Harvested from