Dynamic Contrastive Distillation for Image-Text Retrieval

Rao, Jun, Ding, Liang, Qi, Shuhan, Fang, Meng ORCID: 0000-0001-6745-286X, Liu, Yang, Shen, Li and Tao, Dacheng
(2023) Dynamic Contrastive Distillation for Image-Text Retrieval. IEEE Transactions on Multimedia, 25. pp. 1-13.

[img] Text
TMM_camera.pdf - Author Accepted Manuscript

Download (1MB) | Preview


The recent advancement in vision-and-language pretraining (VLP) has significantly improved the performance of cross-modal image-text retrieval (ITR) systems. However, the increasing size of VLP models presents a challenge for real-world deployment due to their high latency, making them unsuitable for practical search scenarios. To alleviate this problem, we present a novel plug-in dynamic contrastive distillation (DCD) framework to compress the large VLP models for the ITR task. Technically, we face the following two challenges: 1) the typical uni-modal metric learning approach is difficult to directly apply to cross-modal tasks due to the limited GPU memory to optimize too many negative samples during handling cross-modal fusion features. 2) it is inefficient to static optimize the student network from different hard samples, which affects distillation learning and student network optimization. We propose a method for multi-modal contrastive learning that balances training costs and effects. Our approach involves using a teacher network to identify hard samples for student networks to learn from, allowing the students to leverage the knowledge from pre-trained teachers and effectively learn from hard samples. To learn from hard sample pairs, we propose dynamic distillation to dynamically learn samples of different difficulties to balance better the difficulty of knowledge and students' self-learning ability. We successfully apply our proposed DCD strategy on two state-of-the-art vision-language pretrained models, i.e., ViLT and METER. Extensive experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework. We further provide in-depth analyses and discussions that explain how the performance improves.

Item Type: Article
Divisions: Faculty of Science and Engineering > School of Electrical Engineering, Electronics and Computer Science
Depositing User: Symplectic Admin
Date Deposited: 24 May 2023 08:44
Last Modified: 15 Mar 2024 19:28
DOI: 10.1109/tmm.2023.3236837
Related URLs:
URI: https://livrepository.liverpool.ac.uk/id/eprint/3170622