NameTag 2 is a named entity recognition tool. It recognizes named entities (e.g., names, locations, etc.) and can recognize both flat and embedded (nested) entities. NameTag 2 can be used either as a commandline tool or by requesting the NameTag webservice.
NameTag webservice can be found at:
https://lindat.mff.cuni.cz/services/nametag/
NameTag commandline tool can be downloaded from NameTag GitHub repository, branch nametag2:
git clone https://github.com/ufal/nametag -b nametag2
Latest models and documentation can be found at:
https://ufal.mff.cuni.cz/nametag/2
This software subject to the terms of the Mozilla Public License, v. 2.0 (http://mozilla.org/MPL/2.0/). The associated models are distributed under CC BY-NC-SA license.
Please cite as:
Jana Straková, Milan Straka, Jan Hajič (2019): Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5326-5331, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2 (https://aclweb.org/anthology/papers/P/P19/P19-1527/)
NER models for NameTag 2, named entity recognition tool, for English, German, Dutch, Spanish and Czech. Model documentation including performance can be found here: https://ufal.mff.cuni.cz/nametag/2/models . These models are for NameTag 2, named entity recognition tool, which can be found here: https://ufal.mff.cuni.cz/nametag/2 .
NER models for NameTag 2, named entity recognition tool, for English, German, Dutch, Spanish and Czech. Model documentation including performance can be found here: https://ufal.mff.cuni.cz/nametag/2/models . These models are for NameTag 2, named entity recognition tool, which can be found here: https://ufal.mff.cuni.cz/nametag/2 .
Annotation of named entities to the existing source Parallel Global Voices, ces-eng language pair. The named entity annotations distinguish four classes: Person, Organization, Location, Misc. The annotation is in the IOB schema (annotation per token, beginning + inside of the multi-word annotation). NEL annotation contains Wikidata Qnames.
SumeCzech-NER
SumeCzech-NER contains named entity annotations of SumeCzech 1.0 (Straka et al. 2018, SumeCzech: Large Czech News-Based Summarization Dataset).
Format
The dataset is split into four files. Files are in jsonl format. There is one JSON object on each line of the file. The most important fields of JSON objects are:
- dataset: train, dev, test, oodtest
- ne_abstract: list of named entity annotations of article's abstract
- ne_headline: list of named entity annotations of article's headline
- ne_text: list of name entity annotations of article's text
- url: article's URL that can be used to match article across SumeCzech and SumeCzech-NER
Annotations
We used SpaCy's NER model trained on CoNLL-based extended CNEC 2.0. The model achieved a 78.45 F-Score on the dataset's testing set. The annotations are in IOB2 format. The entity types are: Numbers in addresses, Geographical names, Institutions, Media names, Artifact names, Personal names, and Time expressions.
Tokenization
We used the following Python code for tokenization:
from typing import List
from nltk.tokenize import word_tokenize
def tokenize(text: str) -> List[str]:
for mark in ('.', ',', '?', '!', '-', '–', '/'):
text = text.replace(mark, f' {mark} ')
tokens = word_tokenize(text)
return tokens