Number of results to display per page
Search Results
182. LAC English Corpus
- Publisher:
- Max Planck Institute for Psycholinguistics
- Type:
- corpus
- Language:
- English
- Description:
- Language and Cognition corpus
- Rights:
- Not specified
183. Large-Scale Colloquial Persian 0.5
- Creator:
- Abdi Khojasteh, Hadi, Ansari, Ebrahim, and Bohlouli, Mahdi
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) and Institute for Advanced Studies in Basic Sciences (IASBS)
- Type:
- text and corpus
- Subject:
- PoS tagging, corpus, annotated corpus, multilingual, derivation, dependency parser, machine translation, informal language, spoken language, monolingual corpus, and bilingual corpus annotation
- Language:
- Persian, English, German, Czech, Italian, and Hindi
- Description:
- "Large Scale Colloquial Persian Dataset" (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. LSCP includes 120M sentences from 27M casual Persian tweets with its dependency relations in syntactic annotation, Part-of-speech tags, sentiment polarity and automatic translation of original Persian sentences in five different languages (EN, CS, DE, IT, HI).
- Rights:
- Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB
184. LCsum (Document Summarizer)
- Publisher:
- Centro de Tecnologías y Aplicaciones del Lenguaje y del Habla (TALP)
- Type:
- toolService
- Language:
- Catalan, English, and Spanish
- Description:
- Document summarizer.
- Rights:
- Not specified
185. Lingua::Interset 2.026
- Creator:
- Zeman, Daniel
- Publisher:
- Charles University, Faculty of Mathematics and Physics
- Type:
- tool and toolService
- Subject:
- morphology, part of speech, conversion, and tagset
- Language:
- Arabic, Bulgarian, Bengali, Catalan, Czech, Danish, German, Modern Greek (1453-), English, Spanish, Estonian, Basque, Persian, Finnish, Ancient Greek (to 1453), Hebrew, Hindi, Croatian, Japanese, Multiple languages, and Portuguese
- Description:
- Lingua::Interset is a universal morphosyntactic feature set to which all tagsets of all corpora/languages can be mapped. Version 2.026 covers 37 different tagsets of 21 languages. Limited support of the older drivers for other languages (which are not included in this package but are available for download elsewhere) is also available; these will be fully ported to Interset 2 in future. Interset is implemented as Perl libraries. It is also available via CPAN.
- Rights:
- Artistic License (Perl) 1.0, http://opensource.org/licenses/Artistic-Perl-1.0, and PUB
186. LiStr: Linguistic Structure Induction Tookit
- Creator:
- Mareček, David and Straka, Milan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- tool and toolService
- Subject:
- parsing, unsupervised machine learning, machine translation, and grammar induction
- Language:
- English
- Description:
- This toolkit comprises the tools and supporting scripts for unsupervised induction of dependency trees from raw texts or texts with already assigned part-of-speech tags. There are also scripts for simple machine translation based on unsupervised parsing and scripts for minimally supervised parsing into Universal-Dependencies style.
- Rights:
- GNU General Public Licence, version 3, http://opensource.org/licenses/GPL-3.0, and PUB
187. LongEval Click-Model Relevance Judgements (Qrels)
- Creator:
- Galuščáková, Petra, Devaud, Romain, Gonzalez-Saez, Gabriela, Mulhem, Philippe, Goeuriot, Lorraine, Piroi, Florina, and Popel, Martin
- Publisher:
- Université Grenoble Alpes
- Type:
- text and corpus
- Subject:
- information retrieval, automatic evaluation, and evaluation
- Language:
- French and English
- Description:
- The collection comprises the relevance judgments used in the 2023 LongEval Information Retrieval Lab (https://clef-longeval.github.io/), organized at CLEF. It consists of three sets of relevance judgments: 1) Relevance judgments for the heldout queries from the LongEval Train Collection (http://hdl.handle.net/11234/1-5010). 2) Relevance judgments for the short-term persistence (sub-task A) queries from the LongEval Test Collection (http://hdl.handle.net/11234/1-5139). 3) Relevance judgments for the long-term persistence (sub-task B) queries from the LongEval Test Collection (http://hdl.handle.net/11234/1-5139). These judgments were provided by the Qwant search engine (https://www.qwant.com) and were generated using a click model. The click model output was based on the clicks of Qwant's users, but it mitigates noise from raw user clicks caused by positional bias and also better safeguards users' privacy. Consequently, it can serve as a reliable soft relevance estimate for evaluating and training models. The collection includes a total of 1,420 judgments for the heldout queries, with 74 considered highly relevant and 326 deemed relevant. For the short-term sub-task queries, there are 12,217 judgments, including 762 highly relevant and 2,608 relevant ones. As for the long-term sub-task queries, there are 13,467 judgments, with 936 being highly relevant and 2,899 relevant.
- Rights:
- Qwant LongEval Attribution-NonCommercial-ShareAlike License, https://lindat.mff.cuni.cz/repository/xmlui/page/Qwant_LongEval_BY-NC-SA_License, and PUB
188. LongEval Test Collection
- Creator:
- Galuščáková, Petra, Devaud, Romain, Gonzalez-Saez, Gabriela, Mulhem, Philippe, Goeuriot, Lorraine, Piroi, Florina, and Popel, Martin
- Publisher:
- Université Grenoble Alpes, Qwant, Research Studios Austria, and Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- information retrieval, cross-language, cross-lingual information retrieval, parallel corpus, and search
- Language:
- French and English
- Description:
- The collection consists of queries and documents provided by the Qwant search Engine (https://www.qwant.com). The queries, which were issued by the users of Qwant, are based on the selected trending topics. The documents in the collection are the webpages which were selected with respect to these queries using the Qwant click model. Apart from the documents selected using this model, the collection also contains randomly selected documents from the Qwant index. The collection serves as the official test collection for the 2023 LongEval Information Retrieval Lab (https://clef-longeval.github.io/) organised at CLEF. The collection contains test datasets for two organized sub-tasks: short-term persistence (sub-task A) and long-term persistence (sub-task B). The data for the short-term persistence sub-task was collected over July 2022 and this dataset contains 1,593,376 documents and 882 queries. The data for the long-term persistence sub-task was collected over September 2022 and this dataset consists of 1,081,334 documents and 923 queries. Apart from the original French versions of the webpages and queries, the collection also contains their translations into English.
- Rights:
- Qwant LongEval Attribution-NonCommercial-ShareAlike License, https://lindat.mff.cuni.cz/repository/xmlui/page/Qwant_LongEval_BY-NC-SA_License, and PUB
189. LongEval Train Collection
- Creator:
- Galuščáková, Petra, Devaud, Romain, Gonzalez-Saez, Gabriela, Mulhem, Philippe, Goeuriot, Lorraine, Piroi, Florina, and Popel, Martin
- Publisher:
- Université Grenoble Alpes, Qwant, Research Studios Austria, and Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- information retrieval, parallel corpus, search, and automatic evaluation
- Language:
- French and English
- Description:
- The collection consists of queries and documents provided by the Qwant search Engine (https://www.qwant.com). The queries, which were issued by the users of Qwant, are based on the selected trending topics. The documents in the collection were selected with respect to these queries using the Qwant click model. Apart from the documents selected using this model, the collection also contains randomly selected documents from the Qwant index. All the data were collected over June 2022. In total, the collection contains 672 train queries, with corresponding 9656 assessments coming from the Qwant click model, and 98 heldout queries. The set of documents consist of 1,570,734 downloaded, cleaned and filtered Web Pages. Apart from their original French versions, the collection also contains translations of the webpages and queries into English. The collection serves as the official training collection for the 2023 LongEval Information Retrieval Lab (https://clef-longeval.github.io/) organised at CLEF.
- Rights:
- Qwant LongEval Attribution-NonCommercial-ShareAlike License, PUB, and https://lindat.mff.cuni.cz/repository/xmlui/page/Qwant_LongEval_BY-NC-SA_License
190. Machine Translation Testsuite for Gender-Consistent Translation
- Creator:
- Aires, João Paulo
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- machine translation, testsuite, evaluation, and gender
- Language:
- English and Czech
- Description:
- Document-level testsuite for evaluation of gender translation consistency. Our Document-Level test set consists of selected English documents from the WMT21 newstest annotated with gender information. Czech unnanotated references are also added for convenience. We semi-automatically annotated person names and pronouns to identify the gender of these elements as well as coreferences. Our proposed annotation consists of three elements: (1) an ID, (2) an element class, and (3) gender. The ID identifies a person's name and its occurrences (name and pronouns). The element class identifies whether the tag refers to a name or a pronoun. Finally, the gender information defines whether the element is masculine or feminine. We performed a series of NLP techniques to automatically identify person names and coreferences. This initial process resulted in a set containing 45 documents to be manually annotated. Thus, we started a manual annotation of these documents to make sure they are correctly tagged. See README.md for more details.
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB