« Previous |
1 - 10 of 11
|
Next »
Number of results to display per page
Search Results
2. Addressed Arabic Phonetic Rules
- Creator:
- Mustafa, Ebtihal and Bouzoubaa, Karim
- Publisher:
- languages journal
- Type:
- text, wordList, and lexicalConceptualResource
- Subject:
- phonetics and Arabic phonetic System.
- Language:
- Arabic
- Description:
- This xml file describes the Arabic phonetic constraints are to be applied on Arabic root. The first rule category lists the letters that may not occur in the same root, regardless of their order. The second category lists the letters that may not be used together in a root word with a specific order. The third and fourth categories show that each contiguous letters must not be redundant ISLRN: 991-445-325-823-5
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
3. Arabic Phonetic Rules
- Creator:
- Mustafa, Ebtihal and Bouzoubaa, Karim
- Publisher:
- languages journal
- Type:
- text, other, and lexicalConceptualResource
- Subject:
- Arabic and phonetic rules
- Language:
- Arabic
- Description:
- Description: this xml file describes the Arabic phonetic constraints (rules) resulting from the analysis of the lexicons(Taj Alarous, Al ain, Lisan Al arab, Alwassit and almoassir ). These rules are to be applied to Arabic roots and are classified into a number of categories. Each category has a certain type of constraints as follow: The first category defines that the root must not consist of three identical letters. The second category defines that the root must not start with two repeating letters. The third category lists the letters that must not occur in the same root, regardless of their order. The fourth category lists the letters that may not be used together in a certain order in a root. ISLRN: 190-535-098-473-3
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
4. C4Corpus (CC BY-NC part)
- Creator:
- Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
- Publisher:
- Technische Universität Darmstadt
- Type:
- text and corpus
- Subject:
- CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
- Language:
- Afrikaans, Arabic, Bengali, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Latvian, Lithuanian, Malayalam, Macedonian, Nepali (macrolanguage), Dutch, Norwegian, Panjabi, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Spanish, Albanian, Swahili (macrolanguage), Swedish, Tamil, Telugu, Tagalog, Thai, Turkish, Ukrainian, Undetermined, Vietnamese, and Chinese
- Description:
- A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
5. Czech Court Decisions Corpus (CzCDC 1.0)
- Creator:
- Novotná, Tereza and Harašta, Jakub
- Publisher:
- Faculty of Law, Masaryk University
- Type:
- text and corpus
- Subject:
- legal texts, judicial decisions, and court decisions
- Language:
- Czech
- Description:
- This is the Czech Court Decisions Corpus (CzCDC 1.0). This corpus contains whole texts of the decisions from three top-tier courts (Supreme, Supreme Administrative and Constitutional court) in Czech republic. Court decisions are published from 1st January 1993 to 30th September 2018. The language of decisions is Czech. Content of decisions is unedited and obtained directly from the competent court. Decisions are in .txt format in three folders divided by courts. Corpus contains three .csv files containing the list of all decisions with four columns: - name of the file: exact file name of a decision with extension .txt; - decision identifier (docket number): official identification of the decision as issued by the court; - date of decision: in ISO 8601 (YYYY-MM-DD); - court abbreviation: SupCo for Supreme Court, SupAdmCo for Supreme Administrative Court, ConCo for Constitutional Court Statistics: - SupCo: 111 977 decisions, 23 699 639 lines, 224 061 129 words, 1 462 948 200 bits; - SupAdmCo: 52 660 decisions, 18 069 993 lines, 137 839 985 words, 1 067 826 507 bits; - ConCo: 73 086 decisions, 6 178 371 lines, 98 623 753 words, 664 657 755 bits - all courts combined: 237 723 decisions, 47 948 003 lines, 460 524 867 words, 3 195 432 462 bits
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
6. Extended CLEF eHealth 2013-2015 IR Test Collection
- Creator:
- Pecina, Pavel and Saleh, Shadi
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- cross-lingual information retrieval and machine translation
- Language:
- English, Czech, French, German, Hungarian, Polish, Spanish, and Swedish
- Description:
- This package contains an extended version of the test collection used in the CLEF eHealth Information Retrieval tasks in 2013--2015. Compared to the original version, it provides complete query translations into Czech, French, German, Hungarian, Polish, Spanish and Swedish and additional relevance assessment.
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
7. FAUST 0.5
- Creator:
- Hajič, Jan, Mareček, David, Fučíková, Eva, Cinková, Silvie, Štěpánek, Jan, and Mikulová, Marie
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- tectogrammatics, treebank, parallel corpus, and noisy texts
- Language:
- English and Czech
- Description:
- Syntactic (including deep-syntactic - tectogrammatical) annotation of user-generated noisy sentences. The annotation was made on Czech-English and English-Czech Faust Dev/Test sets. The English data includes manual annotations of English reference translations of Czech source texts. This texts were translated independently by two translators. After some necessary cleanings, 1000 segments were randomly selected for manual annotation. Both the reference translations were annotated, which means 2000 annotated segments in total. The Czech data includes manual annotations of Czech reference translations of English source texts. This texts were translated independently by three translators. After some necessary cleanings, 1000 segments were randomly selected for manual annotation. All three reference translations were annotated, which means 3000 annotated segments in total. Faust is part of PDT-C 1.0 (http://hdl.handle.net/11234/1-3185).
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
8. Khresmoi Query Translation Test Data 2.0
- Creator:
- Pecina, Pavel, Dušek, Ondřej, Hajič, Jan, Libovický, Jindřich, and Urešová, Zdeňka
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- corpus, test data, medical, health, machine translation, Czech, English, French, German, Hungarian, Polish, Spanish, and Swedish
- Language:
- Czech, English, French, German, Hungarian, Polish, Spanish, and Swedish
- Description:
- This package contains data sets for development and testing of machine translation of medical queries between Czech, English, French, German, Hungarian, Polish, Spanish ans Swedish. The queries come from general public and medical experts. This is version 2.0 extending the previous version by adding Hungarian, Polish, Spanish, and Swedish translations.
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
9. Khresmoi Summary Translation Test Data 2.0
- Creator:
- Dušek, Ondřej, Hajič, Jan, Hlaváčová, Jaroslava, Libovický, Jindřich, Pecina, Pavel, Tamchyna, Aleš, and Urešová, Zdeňka
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- corpus, test data, medical, health, machine translation, Czech, English, French, German, Hungarian, Polish, Spanish, and Swedish
- Language:
- Czech, English, French, German, Hungarian, Polish, Spanish, and Swedish
- Description:
- This package contains data sets for development (Section dev) and testing (Section test) of machine translation of sentences from summaries of medical articles between Czech, English, French, German, Hungarian, Polish, Spanish and Swedish. Version 2.0 extends the previous version by adding Hungarian, Polish, Spanish, and Swedish translations.
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
10. Machine Translation Testsuite for Gender-Consistent Translation
- Creator:
- Aires, João Paulo
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- machine translation, testsuite, evaluation, and gender
- Language:
- English and Czech
- Description:
- Document-level testsuite for evaluation of gender translation consistency. Our Document-Level test set consists of selected English documents from the WMT21 newstest annotated with gender information. Czech unnanotated references are also added for convenience. We semi-automatically annotated person names and pronouns to identify the gender of these elements as well as coreferences. Our proposed annotation consists of three elements: (1) an ID, (2) an element class, and (3) gender. The ID identifies a person's name and its occurrences (name and pronouns). The element class identifies whether the tag refers to a name or a pronoun. Finally, the gender information defines whether the element is masculine or feminine. We performed a series of NLP techniques to automatically identify person names and coreferences. This initial process resulted in a set containing 45 documents to be manually annotated. Thus, we started a manual annotation of these documents to make sure they are correctly tagged. See README.md for more details.
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB