Victor is a web page cleaning tool. It is aimed at removing menu, ads, footers, headers, etc. from HTML web pages, so that only main web page content remains. Victor is based on a conditional random fields algorithm.
Victoria is an on-line HTML web page annotation tool suitable for selecting texts on the web pages. It can be used to mark important/interesting parts of web pages for further processing.
A tool used to build multilingual corpora from wikipedia. Download the web pages, convert them to plain text, identify language, etc.
A set of 120 corpora collected using this tool is available at https://ufal-point.mff.cuni.cz/xmlui/handle/11858/00-097C-0000-0022-6133-9
Marian NMT model for Catalan to Occitan translation. It is a multi-task model, producing also a phonemic transcription of the Catalan source. The model was submitted to WMT'21 Shared Task on Multilingual Low-Resource Translation for Indo-European Languages as a CUNI-Contrastive system for Catalan to Occitan.
Marian NMT model for Catalan to Occitan translation. Primary CUNI submission for WMT21 Multilingual
Low-Resource Translation for Indo-European Languages Shared Task.
Marian multilingual translation model from Catalan into Romanian, Italian and Occitan. Primary CUNI submission for WMT21 Multilingual
Low-Resource Translation for Indo-European Languages Shared Task.