Parallel treebanks with annotation of syntax, discourse, coreference, morphology, and semantics. Version 3 also includes the Danish Dependency Treebank (version 1) and the Danish-English Parallel Dependency Treebank (version 2).
DTAG is a versatile annotation tool that supports manual and semi-automatic annotation of a wide range of linguistic phenomena, including the annotation of syntax, discourse, coreference, morphology, and word and phrase alignments. It includes commands for editing general labeled graphs and graph alignments, comparing annotations, managing annotation tasks, and interfacing with a revision control system. Its visualization component can display graphs and alignments for entire texts in a compact format, with a highly flexible and configurable formatting scheme. It also provides a powerful search-replace mechanism with queries based on full first-order logic, which can be used to search for linguistic constructions and automatically apply graph transformations to collections of annotated graphs. The visualization component does not currently support characters outside the ISO-latin character set.
1) Finds repeated sequences of words in documents (repetitiveness checker) 2) Finds common sequences of words in several documents (version comparison) A sequence of words consists of minimally two words. There is no upper limit of the number of words in a sequence, but sequences do not transgress sentence delimiters. There are several weight functions to choose from, each defining "good" sequences in a different way, based on word frequency, sequence lenght and number of repetitions.