Primate pose estimation: CPM + HRNet Hybrid Architecture
This project tackled the OpenMonkeyChallenge, which aims to estimate 17 anatomical keypoints from single images of various monkey species. We developed and benchmarked a deep learning pipeline for markerless pose estimation using convolutional pose machines (CPMs), High-Resolution Networks (HRNet), and a novel hybrid CPM-HRNet model.
This project tackled the OpenMonkeyChallenge, which aims to estimate 17 anatomical keypoints from single images of various monkey species. We developed and benchmarked a deep learning pipeline for markerless pose estimation using convolutional pose machines (CPMs), High-Resolution Networks (HRNet), and a novel hybrid CPM-HRNet model.
Problem Scope and Dataset
The task involved single-instance keypoint detection using a dataset of 112,000+ images from 26 non-human primate species. Ground truth annotations included 17 anatomical landmarks per monkey. The dataset was split into 66,917 training images and 22,306 validation/test images. We used top-down pose estimation: cropping monkey instances using bounding boxes before applying keypoint prediction.
Baseline Models: CPM and HRNet
CPMs use a staged architecture where heatmaps generated from one stage refine the predictions of the next. Intermediate supervision improves convergence and helps handle occlusions or visual ambiguities.
HRNet, a state-of-the-art backbone, maintains high-resolution feature maps throughout, enabling accurate spatial localization of keypoints.
We trained both models from scratch, tuning hyperparameters and performing inference using multi-stage heatmap regression. Evaluation used PCK@0.05 (Probability of Correct Keypoint) as the metric.
Proposed Model: CPM-HRNet Hybrid
To address the diversity and complexity of monkey poses and species-specific anatomical variations, we proposed a CPM-HRNet hybrid architecture:
Replaced the initial CPM feature extractor with HRNet32 or HRNet48, retaining high-resolution contextual features.
Replaced later convolutional blocks in CPM with a custom “small-HRNet” module to increase model capacity without excessive compute.
Integrated intermediate supervision at each stage to guide heatmap prediction and gradient flow.
Used Gaussian-distributed ground truth heatmaps and summed loss across all stages for backpropagation.
Training & Augmentation
Training was conducted using Adam optimizer with learning rates of 1e-4 (CPM) and 1e-3 (HRNet/hybrid), across 30 epochs, with image size 224×224, batch size 32, and sigma=3 for ground truth heatmaps. To combat overfitting and improve generalization:
Applied four data augmentations: random flip, rotation (±15°), scale (0.75–1.25×), and HSV shift.
Observed significant boost in validation accuracy post augmentation.
Results
MODEL | Header 2PCK@0.05 (Total) |
|---|---|
CPM | 62.16% |
HRNet - 32 | 73.77% |
HRNet - 48 | 77.90% |
Ours (HRNet32) | 78.44% |
Ours (HRNet48) | 79.85% |
CPM 62.16%
HRNet-32 73.77%HRNet-4877.90%Ours (HRNet32)78.44%Ours (HRNet48)79.85%
Tail and hip keypoints remained challenging across models due to visual ambiguity and inter-species variation.
The hybrid model outperformed all baselines across nearly all keypoints and exhibited stronger robustness to occlusion and anatomical diversity.
