Solving Cosine Similarity Underestimation between High Frequency Words by ℓ2 Norm Discounting Norm Discounting

Wannasuphoprasit, Saeth, Zhou, Yi and Bollegala, Danushka
(2023) Solving Cosine Similarity Underestimation between High Frequency Words by ℓ2 Norm Discounting Norm Discounting. In: Findings of the Association for Computational Linguistics: ACL 2023, 2023-7 - 2023-7, Toronto, Canada.

[thumbnail of Curious_case_of_High_Freq_Words.pdf] Text
Curious_case_of_High_Freq_Words.pdf - Author Accepted Manuscript

Download (896kB) | Preview


Cosine similarity between two words, computed using their contextualised token embeddings obtained from masked language models (MLMs) such as BERT has shown to underestimate the actual similarity between those words (Zhou et al., 2022). This similarity underestimation problem is particularly severe for highly frequent words. Although this problem has been noted in prior work, no solution has been proposed thus far. We observe that the ℓ2 norm of contextualised embeddings of a word correlates with its log-frequency in the pretraining corpus. Consequently, the larger ℓ2 norms associated with the highly frequent words reduce the cosine similarity values measured between them, thus underestimating the similarity scores. To solve this issue, we propose a method to discount the ℓ2 norm of a contextualised word embedding by the frequency of that word in a corpus when measuring the cosine similarities between words. We show that the so called stop words behave differently from the rest of the words, which require special consideration during their discounting process. Experimental results on a contextualised word similarity dataset show that our proposed discounting method accurately solves the similarity underestimation problem.

Item Type: Conference or Workshop Item (Unspecified)
Uncontrolled Keywords: 4901 Applied Mathematics, 49 Mathematical Sciences, 52 Psychology
Divisions: Faculty of Science and Engineering > School of Electrical Engineering, Electronics and Computer Science
Depositing User: Symplectic Admin
Date Deposited: 25 May 2023 07:29
Last Modified: 20 Jun 2024 19:40
DOI: 10.18653/v1/2023.findings-acl.550
Related URLs: