Abstract: Hand key point detection is crucial for facilitating natural human-computer interactions. However, this task is highly challenging due to the intricate variations stemming from complex articulations, diverse viewpoints, self-similar parts, significant self-occlusions, as well as variations in shapes and sizes. To address these challenges, the thesis proposes several innovative contributions. Firstly, it introduces a novel approach employing a multi-camera system to train precise detectors for key points, particularly those susceptible to occlusion, such as the hand joints. This methodology, termed multiview bootstrapping, begins with an initial key point detector generating noisy labels across multiple hand views. Subsequently, these noisy detections undergo triangulation in 3D utilizing Multiview geometry or are identified as outliers. These triangulations, upon re-projection, serve as new labeled training data to refine the detector. This iterative process iterates, yielding additional labeled data with each iteration. The thesis also presents an analytical derivation establishing the minimum number of views necessary to achieve predetermined true and false-positive rates for a given detector. This methodology is further employed to train a hand key point detector tailored for single images. The resultant detector operates in real-time on RGB images and exhibits accuracy on par with methods utilizing depth sensors. Leveraging a single-view detector triangulated over multiple perspectives enables markerless 3D hand motion capture, even amidst complex object interactions.

Keywords: Convolutional Neural Network, Key point detector, Density Network with a Single Gaussian Model, Mixture Density Network, Degree of Freedom.


PDF | DOI: 10.17148/IJARCCE.2024.13477

Open chat
Chat with IJARCCE