Investigation of Human Pose Estimation from 2D Images to 3D Point Clouds by Deep Learning Algorithm

Jiang, Chenru (2023) Investigation of Human Pose Estimation from 2D Images to 3D Point Clouds by Deep Learning Algorithm. PhD thesis, University of Liverpool.

Text
201026336_Aug2023.pdf - Author Accepted Manuscript
Access to this file is embargoed until 1 January 2027.
Download (15MB)

Abstract

Human pose estimation is a crucial task in computer vision that has found widespread use in various high-level applications. While 2D image-based estimation methods have been extensively studied, recent attention has turned towards 3D point cloud-based strategies. This thesis not only focuses on conventional 2D image-based human pose estimation, but also explores the emerging area of 3D point cloud estimation. By investigating both 2D and 3D methods, this thesis aims to provide a comprehensive understanding of human pose estimation techniques and their potential applications. Deep neural networks with multi-scale feature fusion have achieved remarkable success in 2D image-based human pose estimation. Despite this, mainstream methods in this area still suffer from four major drawbacks. 1) They consider multi-scale features equally, which may over-emphasize redundant features; 2) Preferring deeper structures, they can learn features with the strong semantic representation, but tend to lose natural discriminative information; 3) To attain good performance, they rely heavily on pre-training, which is time-consuming, or even unavailable practically; 4) Most existing approaches adopt complicated networks with a large number of parameters, leading to a heavy model with poor cost-effectiveness in practice. To mitigate the first three problems, we propose a novel comprehensive recalibration model called Pyramid GAting Network (PGA-Net) that is capable of distillating, selecting, and fusing the discriminative and attention-aware features at different scales and different levels (i.e., both semantic and natural levels). Meanwhile, focusing on fusing features both selectively and comprehensively, PGA-Net can demonstrate remarkable stability and encouraging performance even without pre-training, making the model can be trained truly from scratch. In addition, we further re-design the feature transformation components of PGA-Net which aims to aggregate more discriminative representations while reducing computational cost. The improved method is termed as PGA-Net 2.0. For the fourth problem, we think that it is necessary to develop a new backbone for attain good balance on effectiveness and efficiency. Therefore, we explore a small yet discrimicative model called STair Network, which can be simply stacked towards an accurate multi-stage pose estimation system. Specifically, to reduce computational cost, STair Network is composed of novel basic feature extraction blocks which focus on promoting feature diversity and obtaining rich local representations with fewer parameters, enabling a satisfactory balance on efficiency and performance. In contrast to 2D image-based methods, 3D point cloud based human pose estimation faces the challenge of dealing with unstructured, irregular spatial coordinates as input data. As a result, the initial information is limited, and some approaches have resorted to converting coordinates into voxels or 2D images to establish fixed inner relationships. While processing 3D point cloud data directly has become dominant in several fundamental tasks. Despite their success, these methods suffer from following drawback: Although neighborhood construction is critical to build inner relationships for feature updating in each set abstraction block, it is surprisingly under-explored in 3D point cloud analysis. For this problem, we revisit the neighborhood construction based on two questions for the k-Nearest Neighbor (kNN) method: 1) How to define the “Nearest”? and 2) Where to find the “Neighbor”? To answer the first question, we define the “Nearest” in both geometric and semantic space by our explored union selection to comprehensively aggregate both local and global dependencies. For the second question, we propose a snapshot selection to alleviate information loss during the downsampling, and overcome neighborhood mismatch during the upsampling. Built upon our neighborhood construction method, we design a difference-wised attention to specifically focus on discriminative point feature extraction.

Item Type:	Thesis (PhD)
Uncontrolled Keywords:	Deep Leaning, Machine Learning, Computer Vision, Human Pose Estimation, 3D Point Cloud
Divisions:	Faculty of Science and Engineering > School of Electrical Engineering, Electronics and Computer Science
Depositing User:	Symplectic Admin
Date Deposited:	31 Jan 2024 08:48
Last Modified:	31 Jan 2024 08:48
DOI:	10.17638/03172611
Supervisors:	Xiao, Jimin Zhang, Rui Goulermas, John Huang, Kaizhu
URI:	https://livrepository.liverpool.ac.uk/id/eprint/3172611