A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Relationship extraction models for the Czech language. Models are trained on CERED (dataset created by distant supervision on Czech Wikipedia and Wikidata) and recognize a subset of Wikidata relations (listed in CEREDx.LABELS).
We supply a demo.py that performs inference on user-defined input and requirements.txt file for pip. Adapt the demo code to use the model.
Both the dataset and the models are presented in Relationship Extraction thesis.
The segment of Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel), 1938, issue no. 33 shows children playing in gas masks in Prague.
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 1A from 1944 was shot during a Christmas exhibition organised by the Board of Trustees for the Education of Youth and held in the hall of the Black Rose Palace in Na Příkopě Street in Prague from 18 to 22 December. The exhibition included a display of the 500 prettiest toys made as part of the Sewing Dolls initiative. Girls made 59,000 dolls, out of which 44,000 went to the children of the labourers working in the Reich and 15,000 to the children of the German soldiers fighting on the front. The exhibition was toured by Minister of Education and People´s Enlightenment and Chairman of the Board Emanuel Moravec and General Secretary of the Board František Teuner.
Transcripts of longitudinal audio recordings of 7 Czech typical monolingual children between 1;7 to 3;9. Files are in plain text with UTF-8 encoding. Each file represents one recording session of one of the target children and is named with the presudonym of the child and her age at the given session in form YMMDD. Transcription rules and other details are to find on the homepage coczefla.ff.cuni.cz.
A new version of the previously published corpus Chroma. The version 2023.04 includes six children. Two transcripts (Julie20221, Klara30424) were removed since they did not meet the criteria on the dialogical format. The transcripts were revised (eliminating typing errors and inconsistencies in the transcription format) and morphologically annotated by the automatic tool MorphoDiTa. Detailed manual control of the annotation was performed on children's utterances; the annotation of adult data was not checked yet. Files are in plain text with UTF-8 encoding. Each file represents one recording session of one of the target children and is named with the alias of the child and their age at the given session in form YMMDD. Transcription rules and other details can be found on the homepage coczefla.ff.cuni.cz.
A new version of the previously published corpus Chroma wih morphological annotation. The version 2023.07 differs from 2023.04 in that it includes all seven children and it went through an additional careful check of consistency and conformity to the CHAT transcription principles.
Two transcripts (Julie20221, Klara30424) from the previous versions (2022.07, 2019.07) were removed since they did not meet our criteria on dialogical format. All transcripts of recordings made during one day were split into one file. Thus, version 2023.07 consists of 183 files/transcripts. The number of utterances and tokens given here in LINDAT corresponds to children's lines only.
Files are in plain text with UTF-8 encoding. Each file represents one recording session of one of the target children and is named with the alias of the child and their age at the given session in form YMMDD. Transcription rules and other details can be found on the homepage coczefla.ff.cuni.cz.
The segment of Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel), 1938, issue no. 39 appeals to the public to donate blood in preparation for the expected military conflict. It includes illustrative shots of how donated blood is preserved for use in the combat environment. The report includes information about different blood groups and how healthy blood donors are tested by the Czechoslovak Red Cross.