« Previous |
1 - 10 of 23
|
Next »
Number of results to display per page
Search Results
2. Chared
- Creator:
- Pomikálek, Jan
- Publisher:
- Masaryk University, NLP Centre
- Type:
- toolService and tool
- Subject:
- character encoding, character encoding detection, charset, and unicode
- Language:
- English
- Description:
- Chared is a software tool which can detect character encoding of a text document provided the language of the document is known. The language of the text has to be specified as an input parameter so that the corresponding language model can be used. The package contains models for a wide range of languages (currently 57 --- covering all major languages). Furthermore, it provides a training script to learn models for additional languages using a set of user supplied sample html pages in the given language. The detection algorithm is based on determining similarity of byte trigrams vectors. In general, chared should be more accurate than other character encoding detection tools with no language constraints. This is an important advantage allowing precise character decoding needed for building large textual corpora. The tool has been used for building corpora in American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting of 70 billions tokens altogether. Chared is an open source software, licensed under New BSD License and available for download (including the source code) at http://code.google.com/p/chared/. The research leading to this piece of software was published in POMIKÁLEK, Jan a Vít SUCHOMEL. chared: Character Encoding Detection with a Known Language. In Aleš Horák, Pavel Rychlý. RASLAN 2011. 5. vyd. Brno, Czech Republic: Tribun EU, 2011. od s. 125-129, 5 s. ISBN 978-80-263-0077-9. and PRESEMT, Lexical Computing Ltd
- Rights:
- BSD 3-Clause "New" or "Revised" license, http://opensource.org/licenses/BSD-3-Clause, and PUB
3. Copenhagen Dependency Treebanks versions 1-3
- Publisher:
- Copenhagen Business School
- Format:
- application/octet-stream
- Type:
- corpus
- Subject:
- parallel treebank, POS annotation, discourse annotation, morphological annotation, syntactic annotation, and semantic annotation
- Language:
- Danish, English, German, Italian, and Spanish
- Description:
- Parallel treebanks with annotation of syntax, discourse, coreference, morphology, and semantics. Version 3 also includes the Danish Dependency Treebank (version 1) and the Danish-English Parallel Dependency Treebank (version 2).
- Rights:
- GNU General Public License
4. Corpus of contemporary blogs
- Creator:
- Grác, Marek
- Publisher:
- Masaryk University, NLP Centre
- Type:
- text and corpus
- Subject:
- corpus, blogs, annotation, annotators, sentences, and machine learning
- Language:
- Czech
- Description:
- In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based system. In order to make enough training data for machine learning, annotators manually split the corpus of contemporary text CBB.blog (1 million tokens) into sentences. Each file contains one hundredth of the whole corpus and all data were processed in parallel by two annotators. The corpus was created from ten contemporary blogs: hintzu.otaku.cz modnipeklo.cz bloc.cz aleneprokopova.blogspot.com blog.aktualne.cz fuchsova.blog.onaidnes.cz havlik.blog.idnes.cz blog.aktualne.centrum.cz klusak.blogspot.cz myego.cz/welldone
- Rights:
- Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB
5. Croatian-English Parallel Corpus
- Publisher:
- University of Zagreb, Faculty of Humanities and Social Sciences
- Type:
- corpus
- Language:
- Croatian and English
- Description:
- written; domain-specific (newspaper); synchronic; bilingual; parallel; unidirectional; XML; S-alignment
- Rights:
- Not specified
6. Dictionaries of Luxembourgish
- Publisher:
- University of Luxembourg
- Format:
- application/octet-stream
- Type:
- corpus
- Language:
- Luxembourgish
- Description:
- Online database of three older dictionaries of Luxembourgish from 1849, 1905, and 1950
- Rights:
- Not specified
7. Dictionary of the standard Latvian language
- Publisher:
- Institute of Mathematics and Computer Science, University of Latvia
- Format:
- text/html
- Type:
- lexicalConceptualResource
- Subject:
- dictionary
- Language:
- Latvian
- Description:
- ~64 000 entries
- Rights:
- Not specified
8. EDBL: Lexical Data Base for Basque (Euskararen Datu-base Lexikala)
- Publisher:
- University of the Basque Country
- Format:
- application/xml
- Type:
- lexicalConceptualResource
- Language:
- Basque
- Description:
- EDBL (Lexical DataBase for Basque) is the lexical basis needed for the automatic treatment of Basque. It is made up of about 120.000 entries divided into dictionary entries (the same you can find in a conventional dictionay), verb forms and dependent morphemes, all of them with their respective morphological information.
- Rights:
- Only for research and demonstrative purposes
9. English-Urdu Religious Parallel Corpus
- Creator:
- Jawaid, Bushra and Zeman, Daniel
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- parallel corpus, religious text, and machine translation
- Language:
- English and Urdu
- Description:
- English-Urdu parallel corpus is a collection of religious texts (Quran, Bible) in English and Urdu language with sentence alignments. The corpus can be used for experiments with statistical machine translation. Our modifications of crawled data include but are not limited to the following: 1- Manually corrected sentence alignment of the corpora. 2- Our data split (training-development-test) so that our published experiments can be reproduced. 3- Tokenization (optional, but needed to reproduce our experiments). 4- Normalization (optional) of e.g. European vs. Urdu numerals, European vs. Urdu punctuation, removal of Urdu diacritics.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
10. Italian Content Words
- Creator:
- Grella, Matteo
- Publisher:
- Matteo Grella
- Type:
- text, machineReadableDictionary, and lexicalConceptualResource
- Subject:
- morphological dictionary
- Language:
- Italian
- Description:
- This resource is an Italian morphological dictionary for content words, encoded in a JSON Lines format text file. It contains correspondences between surface form and lexical forms of words followed by grammatical features. The surface word forms have been generated algorithmically by using stable phonological and morphological rules of the Italian language. Particular attention has been given to the generation of verbs for which rules have been extracted from the famous A.L e G. Lepschy, La lingua italiana. The dictionary with its remarkable coverage is particularly useful used together with the Italian Function Words (http://hdl.handle.net/11372/LRT-2288) for tasks such as POS-Tagging or Syntactic Parsing.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
- « Previous
- Next »
- 1
- 2
- 3