The valency lexicon PDT-Vallex 4.0 has been built in close connection with the annotation of the Prague Dependency Treebank project (PDT) and its successors (mainly the Prague Czech-English Dependency Treebank project, PCEDT, the spoken language corpus (PDTSC) and corpus of user-generated texts in the project Faust). It contains over 14500 valency frames for almost 8500 verbs which occurred in the PDT, PCEDT, PDTSC and Faust corpora. In addition, there are nouns, adjectives and adverbs, linked from the PDT part only, increasing the total to over 17000 valency frames for 13000 words. All the corpora have been published in 2020 as the PDT-C 1.0 corpus with the PDT-Vallex 4.0 dictionary included; this is a copy of the dictionary published as a separate item for those not interested in the corpora themselves. It is available in electronically processable format (XML), and also in more human readable form including corpus examples (see the WEBSITE link below, and the links to its main publications elsewhere in this metadata). The main feature of the lexicon is its linking to the annotated corpora - each occurrence of each verb is linked to the appropriate valency frame with additional (generalized) information about its usage and surface morphosyntactic form alternatives. It replaces the previously published unversioned edition of PDT-Vallex from 2014.
The segment of Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel), 1938, issue no. 27 reports on the 16th International Congress of PEN. Clubs held in Prague from 26 to 30 June 1938. The delegates include President Edvard Beneš and his wife Hana, the writers Karel Čapek, H. G. Wells and Olga Scheinpflugová, and Vojtěch Mastný, the Czechoslovak envoy to Berlin.
Poet Petr Bezruč with a group of unidentified people. A view of his house in Ostravice. Bezruč among mining apprentices in the Mining Vocational School Residence Hall. Footage from segments of Československý filmový týdeník (Czechoslovak Film Weekly Newsreel) 1958, issue no. 9, and Týden ve filmu (Week in Film) 1945, issue no. 23.
Statistical component of Chimera, a state-of-the-art MT system. and Project DF12P01OVV022 of the Ministry of Culture of the Czech Republic (NAKI -- Amalach).
Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018.
The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages).
For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias].
The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast).
Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day.
The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].
System for querying annotated treebanks in PML format. The querying uses it own query language with graphical representation. It has two different implementations (SQL and Perl) and several clients (TrEd, browser-based, command line interface).
The segment of Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel), 1938, issue no. 42B shows the town of Polička before the German annexation on 10 October 1938. The footage shows a munitions factory, the main square with the town hall, a cemetery, and the removal of chronicles from the local archive on the square. Images of street traffic are combined with shots showing Czech refugees