The Earth Mover’s Distance as a Metric for the Space of Inorganic Compositions



Hargreaves, Cameron J, Dyer, Matthew S ORCID: 0000-0002-4923-3003, Gaultois, Michael W ORCID: 0000-0003-2172-2507, Kurlin, Vitaliy A ORCID: 0000-0001-5328-5351 and Rosseinsky, Matthew J ORCID: 0000-0002-1910-2483
(2020) The Earth Mover’s Distance as a Metric for the Space of Inorganic Compositions. Chemistry of Materials, 32 (24). pp. 10610-10620.

[img] Text
EMD_Revised_Clean_3rd_rev.docx - Author Accepted Manuscript

Download (15MB)
[img] Text
EMD_si_revised.docx - Author Accepted Manuscript

Download (7MB)

Abstract

It is a core problem in any field to reliably tell how close two objects are to being the same, and once this relation has been established, we can use this information to precisely quantify potential relationships, both analytically and with machine learning (ML). For inorganic solids, the chemical composition is a fundamental descriptor, which can be represented by assigning the ratio of each element in the material to a vector. These vectors are a convenient mathematical data structure for measuring similarity, but unfortunately, the standard metric (the Euclidean distance) gives little to no variance in the resultant distances between chemically dissimilar compositions. We present the earth mover's distance (EMD) for inorganic compositions, a well-defined metric which enables the measure of chemical similarity in an explainable fashion. We compute the EMD between two compositions from the ratio of each of the elements and the absolute distance between the elements on the modified Pettifor scale. This simple metric shows clear strength at distinguishing compounds and is efficient to compute in practice. The resultant distances have greater alignment with chemical understanding than the Euclidean distance, which is demonstrated on the binary compositions of the inorganic crystal structure database. The EMD is a reliable numeric measure of chemical similarity that can be incorporated into automated workflows for a range of ML techniques. We have found that with no supervision, the use of this metric gives a distinct partitioning of binary compounds into clear trends and families of chemical property, with future applications for nearest neighbor search queries in chemical database retrieval systems and supervised ML techniques.

Item Type: Article
Depositing User: Symplectic Admin
Date Deposited: 12 Jan 2021 09:15
Last Modified: 18 Jan 2023 23:03
DOI: 10.1021/acs.chemmater.0c03381
Related URLs:
URI: https://livrepository.liverpool.ac.uk/id/eprint/3113257