dc.contributor.author | Rosa, Rudolf |
dc.contributor.author | Zouhar, Vilém |
dc.date.accessioned | 2022-11-11T16:09:44Z |
dc.date.available | 2022-11-11T16:09:44Z |
dc.date.issued | 2022-11-11 |
dc.identifier.uri | http://hdl.handle.net/11234/1-4922 |
dc.description | This is a parallel corpus of Czech and mostly English abstracts of scientific papers and presentations published by authors from the Institute of Formal and Applied Linguistics, Charles University in Prague. For each publication record, the authors are obliged to provide both the original abstract (in Czech or English), and its translation (English or Czech) in the internal Biblio system. The data was filtered for duplicates and missing entries, ensuring that every record is bilingual. Additionally, records of published papers which are indexed by SemanticScholar contain the respective link. The dataset was created from September 2022 image of the Biblio database and is stored in JSONL format, with each line corresponding to one record. |
dc.language.iso | ces |
dc.language.iso | eng |
dc.publisher | Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) |
dc.relation.replaces | http://hdl.handle.net/11234/1-1731 |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | http://creativecommons.org/licenses/by/4.0/ |
dc.source.uri | https://github.com/ufal/bilingual-abstracts-corpus |
dc.subject | parallel corpus |
dc.subject | scientific texts |
dc.subject | abstracts |
dc.title | Czech and English abstracts of ÚFAL papers (2022-11-11) |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
dc.rights.label | PUB |
has.files | yes |
branding | LINDAT / CLARIAH-CZ |
contact.person | Rudolf Rosa rosa@ufal.mff.cuni.cz Charles University in Prague, UFAL |
contact.person | Vilém Zouhar vilem.zouhar@gmail.com ETH Zürich, Department of Computer Science |
sponsor | Grantová agentura Univerzity Karlovy v Praze GAUK 15723/2014 Modelování závislostní syntaxe napříč jazyky nationalFunds |
sponsor | Ministerstvo školství, mládeže a tělovýchovy České republiky LM2018101 LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy nationalFunds |
size.info | 2659 entries |
size.info | 11000 sentences |
size.info | 255000 words |
files.size | 3818008 |
files.count | 1 |
Soubory tohoto záznamu
Licenční kategorie:
Licence: Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
Licence: Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Název
- corpus.jsonl
- Velikost
- 3.64 MB
- Formát
- Neznámý
- Popis
- The corpus
- MD5
- 666b8f01db3671c4db8a298ff3b8eee7