Abstract: The digital synthesis of human kinematics and gestural linguistics represents a sophisticated frontier in computational intelligence, with profound implications for assistive communication, healthcare diagnostics, and touchless human-computer interaction (HCI). Traditional methodologies for movement analysis frequently encounter a "performance-efficiency" bottleneck, where high-fidelity recognition often requires excessive computational overhead, rendering real-time deployment on standard consumer hardware impractical. Furthermore, conventional pixel-based processing is often compromised by environmental noise, varying illumination, and complex background occlusions.
This research introduces an Integrated Multi-Modal Perception Framework that unifies skeletal tracking, behavioral classification, and sign language interpretation into a singular, high-performance ecosystem. The system bypasses the limitations of traditional Convolutional Neural Networks (CNNs) by adopting a landmark-centric approach. By utilizing the MediaPipe perception pipeline, the framework achieves high-fidelity extraction of 33 body landmarks and 21 per-hand keypoints in a 3D coordinate space. This transformation of raw video into a low-dimensional kinematic topology allows for fluid execution without the necessity for dedicated GPU acceleration.
To resolve the challenge of interpreting dynamic motion, the system implements a Long Short-Term Memory (LSTM) Recurrent Neural Network. This architecture is specifically engineered to model spatiotemporal dependencies across sequential frames, enabling the system to distinguish between similar but chronologically distinct actions. A defining innovation of this project is its decoupled modular architecture, which facilitates the independent execution of pose and hand modules through a shared, optimized preprocessing stream.
The integration of confidence-based thresholding and temporal smoothing further ensures the stability of predictions during live interaction. Empirical testing confirms that the proposed system delivers a robust, low-latency solution capable of operating at real-time frame rates on standard CPU architectures. By democratizing access to advanced gesture recognition, this work contributes to the development of inclusive technology that bridges the gap between physical human movement and digital understanding.
Downloads:
|
DOI:
10.17148/IJARCCE.2026.15163
[1] Neha Priya, Rajeshwari N, "REAL-TIME MULTI-MODAL RECOGNITION SYSTEM USING FULL BODY POSE ESTIMATION," International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE), DOI: 10.17148/IJARCCE.2026.15163