Autoencoding Improves Pre-trained Word Embeddings

Kaneko, Masahiro and Bollegala, Danushka ORCID: 0000-0003-4476-7003 (2020) Autoencoding Improves Pre-trained Word Embeddings. In: Proceedings of the 28th International Conference on Computational Linguistics, 2020-12 - 2020-12, Virtual.

Text
main.pdf - Author Accepted Manuscript
Download (211kB) | Preview

Official URL: http://dx.doi.org/10.18653/v1/2020.coling-main.149

Abstract

Prior work investigating the geometry of pre-trained word embeddings have shown that word embeddings to be distributed in a narrow cone and by centering and projecting using principal component vectors one can increase the accuracy of a given set of pre-trained word embeddings. However, theoretically this post-processing step is equivalent to applying a linear autoencoder to minimise the squared `2 reconstruction error. This result contradicts prior work (Mu and Viswanath, 2018) that proposed to remove the top principal components from pre-trained embeddings. We experimentally verify our theoretical claims and show that retaining the top principal components is indeed useful for improving pre-trained word embeddings, without requiring access to additional linguistic resources or labeled data.

Item Type:	Conference or Workshop Item (Unspecified)
Depositing User:	Symplectic Admin
Date Deposited:	03 Nov 2020 10:26
Last Modified:	15 Mar 2024 02:26
DOI:	10.18653/v1/2020.coling-main.149
Related URLs:	Publisher
URI:	https://livrepository.liverpool.ac.uk/id/eprint/3105624