======= KUK 0.0 ======= Authors ======= Barbora Hladká (Charles University, Faculty of Mathematics and Physics), Silvie Cinková (Charles University, Faculty of Mathematics and Physics), Michal Kuk (Frank Bold Society), Jiří Mírovský (Charles University, Faculty of Mathematics and Physics), Tereza Novotná (Charles University, Faculty of Mathematics and Physics), Kristýna Nguyen Zahálková (Frank Bold Society) Introduction ============ KUK 0.0 is a pilot version of a corpus of Czech legal and administrative texts for automatic assessment of accessibility (comprehensibility or clarity) of Czech legal texts. KUK 0.0 contains: - data and meta data from/for three sources: - Public materials of Frank Bold Society (FrBo) - Statements of the Public Defender of Rights (ESO - Evidence stanovisek ombudsmana) - Information flyers by the Public Defender of Rights (OmbuFlyers) - meta data for two external corpora: - Czech Court Decisions Corpus (CzCDC 1.0, http://hdl.handle.net/11372/LRT-3052) - Corpus of Paraphrased Czech Administrative Texts with Reading Comprehension for Readability Studies (LiFR-Law, http://hdl.handle.net/11234/1-5225) Data ==== Documents can be found in directory data/ and are organized in subdirectories according to their origin and format: FrBo/ - Frank Bold texts - are further divided into the following subdirectories: articles/ DOC/ - documents in their original format (DOC or DOCX) TXT/ - documents from directory DOC/ transformed to TXT format analyses/ PDF/ - documents in their original format (PDF) TXT/ - documents from directory PDF/ transformed to TXT format ESO/ - Statements of the Public Defender of Rights HTML/ - documents in their original format (HTML) TXT/ - documents from directory HTML/ transformed to TXT format OmbuFlyers/ - Flyers from the Ombudsman’s Office’s web pages originals/ - outdated versions of the flyers PDF_DOC/ - documents in their original format (PDF or DOCX) TXT/ - documents from directory PDF_DOC/ transformed to TXT format redesigns/ - current versions of the flyers DOC_PDF/ - documents in their original format (DOCX or PDF) TXT/ - documents from directory DOC_PDF/ transformed to TXT format Meta Data ========= Meta data for documents of all five corpora (FrBo, ESO, OmbuFlyers and the external CzCDC 1.0 and LiFR-Law) are placed in directory metadata/. The meta data are stored in four types of tables (tab-separated-values files): - DocumentIdentificationGenreProperties - DocumentFileFormat - DocumentVersion - ContentLinks For each of the five corpora, there is a separate quadruple of tables, identifiable by the acronym of the corpus in the file names; e.g., for the Flyers from the Ombudsman’s Office’s web pages (OmbuFlyers), there are four files representing the meta data: - OmbuFlyers_DocumentIdentificationGenreProperties.tsv - OmbuFlyers_DocumentFileFormat.tsv - OmbuFlyers_DocumentVersion.tsv - OmbuFlyers_ContentLinks.tsv ESO and FrBo have additional source-specific meta data (provided by the authors of the texts and not fitting into the common meta data schema) in the following files: - ESOSpecificColumns.tsv - FrkBoAnalysesSpecificColumns.tsv - FrBoArticlesSpecificColumns.tsv Meta Data Tables Description ============================ DocumentIdentificationGenreProperties - KUK_ID - a collection-wide unique identifier of a (version of a) document - SourceDB - the source of the document (a link whenever publicly available online) - SourceID - id in the source database - DocumentTitle - a source-specific document title - ClarityPursuit - TRUE if written with special care for quality (FALSE otherwise) - Anonymized - is the document anonymized? ("Anonymized by source", "On-site anonymization", "No") - RecipientType - the type of the recipient the document was designated to ("natural person", "legal person", "combined") - RecipientIndividuation - expected familiarity of the recipient with the matter ("individual", "bulk", "public") - AuthorType - the source type of the document ("authority", "individual") - Objectivity - objectivity of the text ("quasiobjective", "persuasive") - LegalActType - the type of the legal act ("individual", "normative") - Bindingness - is the document legally binding? (TRUE, FALSE) DocumentFileFormat - KUK_ID - KUK id - FileFormat - the original file format of the document - FileName - the file name (without the format suffix) - Folder Path - the path of the document directory DocumentVersion - KUK_ID - KUK id - Version - version of the document in its revision history ("Original", "Translation", "Partial Redesign", "Redesign"); each version has its own KUK id - CreationDate - the creation date in yyyy-mm-dd format - SourceOriginalID - KUK id of the first version of the document in its revision history ContentLinks - KUK_ID - KUK id - RefersTo - KUK id of a document that the given document refers to For a more detailed description of the meta data, see https://ufal.mff.cuni.cz/grants/ponk/kuk Citation ======== Please cite the data when using the corpus for your research: Barbora Hladká, Silvie Cinková, Michal Kuk, Jiří Mírovský, Tereza Novotná and Kristýna Nguyen Zahálková: KUK 0.0. Data/software, ÚFAL MFF UK, Prague, Czech Republic, 2023, LINDAT http://hdl.handle.net/11234/1-5363. Licence ======= The corpus KUK 0.0 is distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0) licence. For more information and updates, see https://ufal.mff.cuni.cz/grants/ponk/kuk Acknowledgement =============== The work on the corpus was financed by the TAČR SIGMA project TQ01000526: PONK - Asistent přístupné úřední komunikace.