dc.contributor.author | Estève, Louis Clément |
dc.contributor.author | Savary, Agata |
dc.contributor.author | Lavergne, Thomas |
dc.date.accessioned | 2024-07-12T11:53:50Z |
dc.date.available | 2024-07-12T11:53:50Z |
dc.date.issued | 2024-06-07 |
dc.identifier.uri | http://hdl.handle.net/11234/1-5528 |
dc.description | This resource is a set of 14 vector spaces for single words and Verbal Multiword Expressions (VMWEs) in different languages (German, Greek, Basque, French, Irish, Hebrew, Hindi, Italian, Polish, Brazilian Portuguese, Romanian, Swedish, Turkish, Chinese). They were trained with the Word2Vec algorithm, in its skip-gram version, on PARSEME raw corpora automatically annotated for morpho-syntax (http://hdl.handle.net/11234/1-3367). These corpora were annotated by Seen2Seen, a rule-based VMWE identifier, one of the leading tools of the PARSEME shared task version 1.2. VMWE tokens were merged into single tokens. The format of the vector space files is that of the original Word2Vec implementation by Mikolov et al. (2013), i.e. a binary format. For compression, bzip2 was used. |
dc.language.iso | deu |
dc.language.iso | ell |
dc.language.iso | eus |
dc.language.iso | fra |
dc.language.iso | gle |
dc.language.iso | heb |
dc.language.iso | hin |
dc.language.iso | ita |
dc.language.iso | pol |
dc.language.iso | por |
dc.language.iso | ron |
dc.language.iso | swe |
dc.language.iso | tur |
dc.language.iso | zho |
dc.publisher | Université Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique |
dc.rights | PARSEME Shared Task Raw Corpus Data (v. 1.2) Agreement |
dc.rights.uri | https://lindat.mff.cuni.cz/repository/xmlui/page/licence-mwe-1.2-raw |
dc.source.uri | https://gitlab.com/parseme/corpora |
dc.subject | verbal multiword expressions |
dc.subject | word embeddings |
dc.subject | word2vec |
dc.title | Multilingual static embeddings for Verbal Multiword Expressions trained on PARSEME raw corpora |
dc.type | lexicalConceptualResource |
metashare.ResourceInfo#ContentInfo.mediaType | text |
metashare.ResourceInfo#ContentInfo.detailedType | computationalLexicon |
dc.rights.label | PUB |
has.files | yes |
branding | LINDAT / CLARIAH-CZ |
contact.person | Louis Estève louis.esteve@universite-paris-saclay.fr Université Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique |
sponsor | Université Paris Saclay Plan blanc PhD grant nationalFunds |
size.info | 44412316 entries |
size.info | 17267 multiWordUnits |
files.size | 16752281832 |
files.count | 22 |
Soubory tohoto záznamu
Licenční kategorie:
Licence: PARSEME Shared Task Raw Corpus Data (v. 1.2) Agreement
Publicly Available
Licence: PARSEME Shared Task Raw Corpus Data (v. 1.2) Agreement
- Název
- INSTALL.md
- Velikost
- 1.21 KB
- Formát
- Neznámý
- Popis
- Unknown
- MD5
- 23fbf46cd30ccdae44893d3906946e9f
- Název
- MWE_S2S_DE_typed_100d_skip-gram.bin.xz
- Velikost
- 822.69 MB
- Formát
- application/x-xz
- MD5
- c5968c97c89a332d08d4ec100cc794b1
- Název
- MWE_S2S_EL_typed_100d_skip-gram.bin.xz
- Velikost
- 472.67 MB
- Formát
- application/x-xz
- MD5
- 1408874115eb749721791284e4c0ee1e
- Název
- MWE_S2S_EU_typed_100d_skip-gram.bin.xz
- Velikost
- 145.61 MB
- Formát
- application/x-xz
- MD5
- 0d2536382ddda7a92c68d7e6ffde23d4
- Název
- MWE_S2S_FR_typed_100d_skip-gram.bin.xz
- Velikost
- 1.96 GB
- Formát
- application/x-xz
- MD5
- e4ffbd8874d2ca4593036884c4f7fb0b
- Název
- MWE_S2S_GA_typed_100d_skip-gram.bin.xz
- Velikost
- 197.87 MB
- Formát
- application/x-xz
- MD5
- 12804a1e814d9fd4f03608f821c73f08
- Název
- MWE_S2S_HE_typed_100d_skip-gram.bin.xz
- Velikost
- 117.52 MB
- Formát
- application/x-xz
- MD5
- c382b1f98e4d18c1652ef65d14f0a06b
- Název
- MWE_S2S_HI_typed_100d_skip-gram.bin.xz
- Velikost
- 319.86 MB
- Formát
- application/x-xz
- MD5
- b4fb6af09acf2fee1fa4c426f9b60b53
- Název
- MWE_S2S_IT_typed_100d_skip-gram.bin.xz
- Velikost
- 611.89 MB
- Formát
- application/x-xz
- MD5
- 07a883f11b4442f2b0611e1fa29101f4
- Název
- MWE_S2S_PL_typed_100d_skip-gram.bin.xz
- Velikost
- 3.89 GB
- Formát
- application/x-xz
- MD5
- e2c4349c12f2da5fe4c7a918f67c0ec9
- Název
- MWE_S2S_PT_typed_100d_skip-gram.bin.xz
- Velikost
- 1.51 GB
- Formát
- application/x-xz
- MD5
- a40e2effa510c076c46e0eebe8c31bef
- Název
- MWE_S2S_RO_typed_100d_skip-gram.bin.xz
- Velikost
- 100.33 MB
- Formát
- application/x-xz
- MD5
- 335a84cd858ad1854696cbb7cc41dd90
- Název
- MWE_S2S_SV_typed_100d_skip-gram.bin.xz
- Velikost
- 4.62 GB
- Formát
- application/x-xz
- MD5
- 1c80c77c6eb64d9842fb3cae87d8dce8
- Název
- MWE_S2S_TR_typed_100d_skip-gram.bin.xz
- Velikost
- 237.35 MB
- Formát
- application/x-xz
- MD5
- 4f193bcbc0c4d8dee5cbac6b753afe93
- Název
- MWE_S2S_ZH_typed_100d_skip-gram.bin.xz
- Velikost
- 679.84 MB
- Formát
- application/x-xz
- MD5
- c60eee0a4faf6ed54705e6d080335881
- Název
- load_vectors.py
- Velikost
- 1.29 KB
- Formát
- Neznámý
- MD5
- 83c3839d234ba066126f9fb77bd22c71
- Název
- load_vectors.sh
- Velikost
- 153 bajtů
- Formát
- Neznámý
- MD5
- 8219c98d3d1999d86660a9d8459395ca
- Název
- md5_checksums.txt
- Velikost
- 1022 bajtů
- Formát
- Textový soubor
- MD5
- 0a2210bab3bd4160578317b8f9bd443a
c5968c97c89a332d08d4ec100cc794b1 *MWE_S2S_DE_typed_100d_skip-gram.bin.xz 1408874115eb749721791284e4c0ee1e *MWE_S2S_EL_typed_100d_skip-gram.bin.xz 0d2536382ddda7a92c68d7e6ffde23d4 *MWE_S2S_EU_typed_100d_skip-gram.bin.xz e4ffbd8874d2ca4593036884c4f7fb0b *MWE_S2S_FR_typed_100d_skip-gram.bin.xz 12804a1e814d9fd4f03608f821c73f08 *MWE_S2S_GA_typed_100d_skip-gram.bin.xz c382b1f98e4d18c1652ef65d14f0a06b *MWE_S2S_HE_typed_100d_skip-gram.bin.xz b4fb6af09acf2fee1fa4c426f9b60b53 *MWE_S2S_HI_typed_100d_skip-gram.bin.xz 07a883f11b4442f2b0611e1fa29101f4 *MWE_S2S_IT_typed_100d_skip-gram.bin.xz e2c4349c12f2da5fe4c7a918f67c0ec9 *MWE_S2S_PL_typed_100d_skip-gram.bin.xz a40e2effa510c076c46e0eebe8c31bef *MWE_S2S_PT_typed_100d_skip-gram.bin.xz 335a84cd858ad1854696cbb7cc41dd90 *MWE_S2S_RO_typed_100d_skip-gram.bin.xz 1c80c77c6eb64d9842fb3cae87d8dce8 *MWE_S2S_SV_typed_100d_skip-gram.bin.xz 4f193bcbc0c4d8dee5cbac6b753afe93 *MWE_S2S_TR_typed_100d_skip-gram.bin.xz c60eee0a4faf6ed54705e6d080335881 *MWE_S2S_ZH_typed_ . . .
- Název
- sha3_checksums.txt
- Velikost
- 1.33 KB
- Formát
- Textový soubor
- MD5
- 62c052404a4361b91df2464f21971885
e12d5f4d7539b161098d922ffb8935e9e9d350aec9a0f8aea110aac5 *MWE_S2S_DE_typed_100d_skip-gram.bin.xz 59a224848baee4c956565374935ccb8c53faa1151650ceb9af14f999 *MWE_S2S_EL_typed_100d_skip-gram.bin.xz 6a4f6b423d597db6b6b06942d452eac8aeffa6ef0d6e97d7e88d6c65 *MWE_S2S_EU_typed_100d_skip-gram.bin.xz 297d6cc656d909d62b063af92aada8b222b694e50e54a3e9ac984736 *MWE_S2S_FR_typed_100d_skip-gram.bin.xz 107bff85292d176ef0ac975e9f0625fd8376cba223e7b2f33bb03e7b *MWE_S2S_GA_typed_100d_skip-gram.bin.xz d4951bd3322a635ab0971be28b91b630109ad9fd4878d8d8700b3984 *MWE_S2S_HE_typed_100d_skip-gram.bin.xz 6be3a825011a0213f33d5bce32e58e6aae26232d4a0dcfd4c266d477 *MWE_S2S_HI_typed_100d_skip-gram.bin.xz dd436cd35d95395417eff750690b957aee6f906590b3ea238110cf96 *MWE_S2S_IT_typed_100d_skip-gram.bin.xz c979d05a0d5c2658fc112d7555191b8f0601903a4eed47e83c561cd6 *MWE_S2S_PL_typed_100d_skip-gram.bin.xz a64110568c0987dd152c80d2833e592523580da4991d763129dfb4f8 *MWE_S2S_PT_typed_100d_skip-gram.bin.xz 903c1f3e3d792625a064310623d317 . . .
- Název
- unzip.sh
- Velikost
- 149 bajtů
- Formát
- Neznámý
- MD5
- 85764c8a353ad3462b948e407cbdff5a
- Název
- verify_checksums.sh
- Velikost
- 351 bajtů
- Formát
- Neznámý
- MD5
- 570a61090da4e672e2d7309e0ab7086a