Action Recognition System and Method

Link Copied.

Opportunity

Human action recognition from video is critical for applications like surveillance, gaming, and human-computer interaction. However, a major challenge is viewpoint variation—the subject's appearance changes drastically as the camera angle moves or the subject rotates. Existing methods rely on fixed, human-defined pre-processing (e.g., centering and aligning the skeleton) or require massive, multi-view datasets. These pre-processing strategies are not flexible enough for real-world situations like drone or surveillance footage, where the camera and subject move relative to each other. They also do not explicitly learn optimal viewpoints for action recognition, limiting their robustness and accuracy. A more adaptive, learnable approach is needed.

Technology

This patent presents an action recognition system with a View-Adaptive Neural Network that dynamically learns and corrects for viewpoint variations. The system receives 3D skeleton data (joint positions) from an RGB-D camera or pose estimation. For each frame, a View Adaptation Block applies an unsupervised learning algorithm to determine optimal transformation parameters (rotation angles α, β, γ and translation vector b). It then transforms the entire skeleton using a 3D rotation matrix, effectively re-orienting the subject to a canonical "best view" without manual pre-processing.

The transformed skeleton sequence is then fed into a Graph Neural Network (GNN). The GNN converts the skeleton into a graph (joints=nodes, bones=edges). Using adaptive graph convolutions, it learns both the physical connectivity and latent relationships between joints. Multiple residual blocks process spatio-temporal features, and a classifier outputs the recognized action. The entire network (view adaptation + GNN) is trained end-to-end, allowing the view parameters to be optimized specifically for action recognition accuracy.

Advantages

View-Invariant Recognition: Dynamically adapts to any camera angle without manual pre-processing, significantly improving accuracy under viewpoint changes.
Unsupervised View Learning: The view adaptation block learns optimal transformations using only classification loss, without requiring ground-truth viewpoint labels.
End-to-End Training: The view adaptation and GNN are jointly optimized, leading to better feature learning than separate pre-processing pipelines.
Outperforms State-of-the-Art: Achieves higher accuracy on NTU60 (94.18% CV, 86.21% CS) than methods using fixed pre-processing (A-GCN-P: 92.70% CV, 84.30% CS).
Parameter Efficient: The view adaptation block adds minimal parameters (<0.3M) compared to stacking more GCN layers (+2.33M for 4 layers), yet yields better performance gains.

Applications

Surveillance & Security: Recognizing suspicious actions or gestures from fixed or moving cameras (e.g., CCTV, drones) regardless of viewpoint.
Human-Computer Interaction: Enabling gesture control for smart TVs, gaming consoles, or VR/AR headsets without requiring user-facing orientation.
Autonomous Vehicles: Recognizing pedestrian gestures or driver actions from vehicle-mounted cameras at varying angles.
Healthcare & Rehabilitation: Monitoring patient exercises or daily activities from bedside cameras without restricting camera placement.
Sports Analytics: Analyzing athlete movements from broadcast footage captured from multiple, changing camera angles.

Remarks

CIMDA: P00037

IP Status

Patent filed

Technology Readiness Level (TRL)