Matching and Segmentation for Multimedia Data



Li, Hui
(2022) Matching and Segmentation for Multimedia Data. PhD thesis, University of Liverpool.

[img] Text
201381377_Hui Li_2022.pdf - Author Accepted Manuscript

Download (4MB) | Preview

Abstract

With the development of society, both industry and academia draw increasing attention to multimedia systems, which handle image/video data, audio data, and text data comprehensively and simultaneously. In this thesis, we mainly focus on multi-modality data understanding, combining the two subjects of Computer Vision (CV) and Natural Language Processing (NLP). Such a task is widely used in many real-world scenarios, including criminal search with language descriptions by the witness, robotic navigation with language instruction in the smart industry, terrorist tracking, missing person identification, and so on. However, such a multi-modality system still faces many challenges, limiting its performance and ability in real-life situations, including the domain gap between the modalities of vision and language, the request for high-quality datasets, and so on. Therefore, to better analyze and handle these challenges, this thesis focuses on the two fundamental tasks, including matching and segmentation. Image-Text Matching (ITM) aims to retrieve the texts (images) that describe the most relevant contents for a given image (text) query. Due to the semantic gap between the linguistic and visual domains, aligning and comparing feature representations for languages and images are still challenging. To overcome this limitation, we propose a new framework for the image-text matching task, which uses an auxiliary captioning step to enhance the image feature, where the image feature is fused with the text feature of the captioning output. As the downstream application of ITM, the language-person search is one of the specific cases where language descriptions are provided to retrieve person images, which also suffers the domain gap between linguistic and visual data. To handle this problem, we propose a transformer-based language-person search matching framework with matching conducted between words and image regions for better image-text interaction. However, collecting a large amount of training data is neither cheap nor reliable using human annotations. We further study the one-shot person Re-ID (re-identification) task aiming to match people by offering one labeled reference image for each person, where previous methods request a large number of ground-truth labels. We propose progressive sample mining and representation learning to fit the limited labels for the one-shot Re-ID task better. Referring Expression Segmentation (RES) aims to localize and segment the target according to the given language expression. Existing methods jointly consider the localization and segmentation steps, which rely on the fused visual and linguistic features for both steps. We argue that the conflict between the purpose of finding the object and generating the mask limits the RES performance. To solve this problem, we propose a parallel position-kernel-segmentation pipeline to better isolate then interact with the localization and segmentation steps. In our pipeline, linguistic information will not directly contaminate the visual feature for segmentation. Specifically, the localization step localizes the target object in the image based on the referring expression, then the visual kernel obtained from the localization step guides the segmentation step. This pipeline also enables us to train RES in a weakly-supervised way, where the pixel-level segmentation labels are replaced by click annotations on center and corner points. The position head is fully-supervised trained with the click annotations as supervision, and the segmentation head is trained with weakly-supervised segmentation losses. This thesis focus on the key limitations of the multimedia system, where the experiments prove that the proposed frameworks are effective for specific tasks. The experiments are easy to reproduce with clear details, and source codes are provided for future works aiming at these tasks.

Item Type: Thesis (PhD)
Uncontrolled Keywords: Multimedia Data, Image-Text Matching, Language-Person Search, Re-Identification, Referring Expression Segmentation
Divisions: Faculty of Science and Engineering > School of Electrical Engineering, Electronics and Computer Science
Depositing User: Symplectic Admin
Date Deposited: 10 Nov 2022 15:54
Last Modified: 18 Jan 2023 19:49
DOI: 10.17638/03165729
Supervisors:
URI: https://livrepository.liverpool.ac.uk/id/eprint/3165729