Number of results to display per page
Search Results
12. Arabic Proclitics Lexicon
- Creator:
- Loukili, Taoufik
- Publisher:
- Ibtikarat team
- Type:
- text, lexicon, and lexicalConceptualResource
- Subject:
- proclitics
- Language:
- Arabic
- Description:
- An XML-based file containing all Arabic proclitics
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
13. Arabic Special verbs Lexicon
- Creator:
- Namly, Driss
- Publisher:
- Ibtikarat team
- Type:
- text, lexicon, and lexicalConceptualResource
- Subject:
- particles
- Language:
- Arabic
- Description:
- An XML-based file containing Arabic Stop-words respecting verbs syntax
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
14. Bengali Visual Genome 1.0
- Creator:
- Sen, Arghyadeep, Parida, Shantipriya, Kotwal, Ketan, Panda, Subhadarshi, Bojar, Ondřej, and Dash, Satya Ranjan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- image and corpus
- Subject:
- multi-modal, neural machine translation, image captioning, Bengali captioning, and English-Bengali Multimodal Corpus
- Language:
- English and Bengali
- Description:
- Data ------- Bengali Visual Genome (BVG for short) 1.0 has similar goals as Hindi Visual Genome (HVG) 1.1: to support the Bengali language. Bengali Visual Genome 1.0 is the multi-modal dataset in Bengali for machine translation and image captioning. Bengali Visual Genome is a multimodal dataset consisting of text and images suitable for English-to-Bengali multimodal machine translation tasks and multimodal research. We follow the same selection of short English segments (captions) and the associated images from Visual Genome as HGV 1.1 has. For BVG, we manually translated these captions from English to Bengali taking the associated images into account. The manual translation is performed by the native Bengali speakers without referring to any machine translation system. The training set contains 29K segments. Further 1K and 1.6K segments are provided in development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome. A third test set is called the ``challenge test set'' and consists of 1.4K segments. The challenge test set was created for the WAT2019 multi-modal task by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity. The surrounding words in the sentence however also often include sufficient cues to identify the correct meaning of the ambiguous word. Dataset Formats --------------- The multimodal dataset contains both text and images. The text parts of the dataset (train and test sets) are in simple tab-delimited plain text files. All the text files have seven columns as follows: Column1 - image_id Column2 - X Column3 - Y Column4 - Width Column5 - Height Column6 - English Text Column7 - Bengali Text The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width and Height columns indicate the rectangular region in the image described by the caption. Data Statistics --------------- The statistics of the current release are given below. Parallel Corpus Statistics -------------------------- Dataset Segments English Words Bengali Words ---------- -------- ------------- ------------- Train 28930 143115 113978 Dev 998 4922 3936 Test 1595 7853 6408 Challenge Test 1400 8186 6657 ---------- -------- ------------- ------------- Total 32923 164076 130979 The word counts are approximate, prior to tokenization. Citation -------- If you use this corpus, please cite the following paper: @inproceedings{hindi-visual-genome:2022, title= "{Bengali Visual Genome: A Multimodal Dataset for Machine Translation and Image Captioning}", author={Sen, Arghyadeep and Parida, Shantipriya and Kotwal, Ketan and Panda, Subhadarshi and Bojar, Ond{\v{r}}ej and Dash, Satya Ranjan}, editor={Satapathy, Suresh Chandra and Peer, Peter and Tang, Jinshan and Bhateja, Vikrant and Ghosh, Anumoy}, booktitle= {Intelligent Data Engineering and Analytics}, publisher= {Springer Nature Singapore}, address= {Singapore}, pages = {63--70}, isbn = {978-981-16-6624-7}, doi = {10.1007/978-981-16-6624-7_7}, }
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
15. CERED baseline models
- Creator:
- Šimečková, Zuzana and Straka, Milan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- mlmodel, text, and languageDescription
- Subject:
- relationship extraction
- Language:
- Czech
- Description:
- Relationship extraction models for the Czech language. Models are trained on CERED (dataset created by distant supervision on Czech Wikipedia and Wikidata) and recognize a subset of Wikidata relations (listed in CEREDx.LABELS). We supply a demo.py that performs inference on user-defined input and requirements.txt file for pip. Adapt the demo code to use the model. Both the dataset and the models are presented in Relationship Extraction thesis.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), PUB, and http://creativecommons.org/licenses/by-nc-sa/4.0/
16. Chared
- Creator:
- Pomikálek, Jan
- Publisher:
- Masaryk University, NLP Centre
- Type:
- toolService and tool
- Subject:
- character encoding, character encoding detection, charset, and unicode
- Language:
- English
- Description:
- Chared is a software tool which can detect character encoding of a text document provided the language of the document is known. The language of the text has to be specified as an input parameter so that the corresponding language model can be used. The package contains models for a wide range of languages (currently 57 --- covering all major languages). Furthermore, it provides a training script to learn models for additional languages using a set of user supplied sample html pages in the given language. The detection algorithm is based on determining similarity of byte trigrams vectors. In general, chared should be more accurate than other character encoding detection tools with no language constraints. This is an important advantage allowing precise character decoding needed for building large textual corpora. The tool has been used for building corpora in American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting of 70 billions tokens altogether. Chared is an open source software, licensed under New BSD License and available for download (including the source code) at http://code.google.com/p/chared/. The research leading to this piece of software was published in POMIKÁLEK, Jan a Vít SUCHOMEL. chared: Character Encoding Detection with a Known Language. In Aleš Horák, Pavel Rychlý. RASLAN 2011. 5. vyd. Brno, Czech Republic: Tribun EU, 2011. od s. 125-129, 5 s. ISBN 978-80-263-0077-9. and PRESEMT, Lexical Computing Ltd
- Rights:
- BSD 3-Clause "New" or "Revised" license, http://opensource.org/licenses/BSD-3-Clause, and PUB
17. CoNLL-based Extended Czech Named Entity Corpus 1.0
- Creator:
- Konkol, Michal, Konopík, Miloslav, Ševčíková, Magda, Žabokrtský, Zdeněk, and Straková, Jana
- Publisher:
- University of West Bohemia
- Type:
- text and corpus
- Subject:
- named entity recognition, Czech, and conll
- Language:
- Czech
- Description:
- This is a Czech Named Entity Corpus 1.0 transformed into the CoNLL format. The original corpus can be downloaded from: http://hdl.handle.net/11858/00-097C-0000-0023-1B04-C. The CoNLL transformation is described in this publication: https://link.springer.com/chapter/10.1007/978-3-642-40585-3_20.
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
18. CoNLL-based Extended Czech Named Entity Corpus 2.0
- Creator:
- Konkol, Michal, Konopík, Miloslav, Ševčíková, Magda, Žabokrtský, Zdeněk, Straková, Jana, and Straka, Milan
- Publisher:
- University of West Bohemia
- Type:
- text and corpus
- Subject:
- named entity recognition and Czech
- Language:
- Czech
- Description:
- This is a Czech Named Entity Corpus 2.0 transformed into the CoNLL format. The original corpus can be downloaded from: http://hdl.handle.net/11858/00-097C-0000-0023-1B22-8. The CoNLL transformation is described in this publication: https://link.springer.com/chapter/10.1007/978-3-642-40585-3_20.
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
19. Contemporary Arabic dictionary
- Creator:
- Namly, Driss
- Publisher:
- Ibtikarat Team
- Type:
- text, lexicon, and lexicalConceptualResource
- Subject:
- lexical semantics
- Language:
- Arabic
- Description:
- An XML-based file containing the electronic version of al logha al arabia al moassira (Contemporary Arabic) dictionary. An Arabic monolingual dictionary accomplished by Ahmed Mukhtar Abdul Hamid Omar (deceased: 1424) with the help of a working group
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
20. Corpus of contemporary blogs
- Creator:
- Grác, Marek
- Publisher:
- Masaryk University, NLP Centre
- Type:
- text and corpus
- Subject:
- corpus, blogs, annotation, annotators, sentences, and machine learning
- Language:
- Czech
- Description:
- In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based system. In order to make enough training data for machine learning, annotators manually split the corpus of contemporary text CBB.blog (1 million tokens) into sentences. Each file contains one hundredth of the whole corpus and all data were processed in parallel by two annotators. The corpus was created from ten contemporary blogs: hintzu.otaku.cz modnipeklo.cz bloc.cz aleneprokopova.blogspot.com blog.aktualne.cz fuchsova.blog.onaidnes.cz havlik.blog.idnes.cz blog.aktualne.centrum.cz klusak.blogspot.cz myego.cz/welldone
- Rights:
- Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB