« Previous |
11 - 17 of 17
|
Next »
Number of results to display per page
Search Results
12. Naše Sedlčany
- Type:
- model:article and TEXT
- Language:
- Czech
- Rights:
- http://creativecommons.org/publicdomain/zero/1.0/ and policy:public
13. ParaCrawl Corpus version 1.0
- Creator:
- Koehn, Philipp, Heafield, Kenneth, Forcada, Mikel L., Esplà-Gomis, Miquel, Ortiz-Rojas, Sergio, Sánchez, Gema Ramírez, Cartagena, Víctor M. Sánchez, Haddow, Barry, Bañón, Marta, Střelec, Marek, Samiotou, Anna, and Kamran, Amir
- Publisher:
- ParaCrawl
- Type:
- text and corpus
- Subject:
- ParaCrawl, parallel corpus, CommonCrawl, machine translation, and text corpora
- Language:
- English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Czech, Romanian, Finnish, Latvian, Russian, and Estonian
- Description:
- The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl. Since the data is fairly raw, it is released with two quality metrics that can be used for corpus filtering. An official "clean" version of each corpus uses one of the metrics. For more details and raw data download please visit: http://paracrawl.eu/releases.html
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
14. ParCzech 3.0
- Creator:
- Kopp, Matyáš, Stankov, Vladislav, Bojar, Ondřej, Hladká, Barbora, and Straňák, Pavel
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- audio and corpus
- Subject:
- Parliament of the Czech Republic, Chamber of Deputies, stenographic protocols, TEI encoding, and speech corpus
- Language:
- Czech
- Description:
- The ParCzech 3.0 corpus is the third version of ParCzech consisting of stenographic protocols that record the Chamber of Deputies’ meetings held in the 7th term (2013-2017) and the current 8th term (2017-Mar 2021). The protocols are provided in their original HTML format, Parla-CLARIN TEI format, and the format suitable for Automatic Speech Recognition. The corpus is automatically enriched with the morphological, syntactic, and named-entity annotations using the procedures UDPipe 2 and NameTag 2. The audio files are aligned with the texts in the annotated TEI files.
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
15. ParCzech 4.0
- Creator:
- Kopp, Matyáš
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- Parliament of the Czech Republic, Chamber of Deputies, stenographic protocols, TEI encoding, and speech corpus
- Language:
- Czech
- Description:
- The ParCzech 4.0 corpus consists of stenographic protocols that record the Chamber of Deputies' meetings in the 7th term (2013-2017), the 8th term (2017-2021) and the current 9th term (2021-Jul 2023). The protocols are provided in their original HTML format, Parla-CLARIN TEI format. The corpus is automatically enriched with the morphological, syntactic, and named-entity annotations using the procedures UDPipe 2 and NameTag 2. The audio files are aligned with the texts in the annotated TEI files. The audio files in this corpus are available in AudioPSP 24.01 corpus (http://hdl.handle.net/11234/1-5404). This corpus covers the same period as ParlaMint-CZ corpus v4.0 (http://hdl.handle.net/11356/1860). ParCzech corpus follows and extends the ParlaMint schema. Both annotated and non-annotated versions include hypertext references to voting and parliamentary prints. In addition to ParlaMint's recommendation, the annotated version contains source audio alignment, PDT xtag, and more detailed CNEC2.0 named entity categorization.
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
16. ParCzech PS7 1.0
- Creator:
- Hladká, Barbora, Kopp, Matyáš, and Straňák, Pavel
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- Parliament of the Czech Republic, Chamber of Deputies, stenographic protocols, TEI encoding, and TEITOK
- Language:
- Czech
- Description:
- The ParCzech PS7 1.0 corpus is the very first member of the corpus family of data coming from the Parliament of the Czech Republic. ParCzech PS7 1.0 consists of stenographic protocols that record the Chamber of Deputies' meetings held in the 7th term between 2013-2017. The audio recordings are available as well. Transcripts are provided in the original HTML as harvested, and also converted into TEI-derived XML format for use in TEITOK corpus manager. The corpus is automatically enriched with the morphological and named-entity annotations using the procedures MorphoDita and NameTag.
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
17. ParCzech PS7 2.0
- Creator:
- Hladká, Barbora, Kopp, Matyáš, and Straňák, Pavel
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- Parliament of the Czech Republic, Chamber of Deputies, stenographic protocols, TEI encoding, and TEITOK
- Language:
- Czech
- Description:
- The ParCzech PS7 2.0 corpus is the second version of ParCzech PS7 consisting of stenographic protocols that record the Chamber of Deputies' meetings held in the 7th term between 2013-2017. The protocols are provided in their original HTML format, TEI format and TEI-derived format to make them searchable in the TEITOK corpus manager. Their audio recordings are available as well. The corpus is automatically enriched with the morphological, syntactic, and named-entity annotations using the procedures UDPipe 2 and NameTag 2.
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB