Primate pose estimation: CPM + HRNet Hybrid Architecture

This project tackled the OpenMonkeyChallenge, which aims to estimate 17 anatomical keypoints from single images of various monkey species. We developed and benchmarked a deep learning pipeline for markerless pose estimation using convolutional pose machines (CPMs), High-Resolution Networks (HRNet), and a novel hybrid CPM-HRNet model.

Problem Scope and Dataset

The task involved single-instance keypoint detection using a dataset of 112,000+ images from 26 non-human primate species. Ground truth annotations included 17 anatomical landmarks per monkey. The dataset was split into 66,917 training images and 22,306 validation/test images. We used top-down pose estimation: cropping monkey instances using bounding boxes before applying keypoint prediction.

Baseline Models: CPM and HRNet

CPMs use a staged architecture where heatmaps generated from one stage refine the predictions of the next. Intermediate supervision improves convergence and helps handle occlusions or visual ambiguities.
HRNet, a state-of-the-art backbone, maintains high-resolution feature maps throughout, enabling accurate spatial localization of keypoints.

We trained both models from scratch, tuning hyperparameters and performing inference using multi-stage heatmap regression. Evaluation used PCK@0.05 (Probability of Correct Keypoint) as the metric.

Proposed Model: CPM-HRNet Hybrid

To address the diversity and complexity of monkey poses and species-specific anatomical variations, we proposed a CPM-HRNet hybrid architecture:

Replaced the initial CPM feature extractor with HRNet32 or HRNet48, retaining high-resolution contextual features.
Replaced later convolutional blocks in CPM with a custom “small-HRNet” module to increase model capacity without excessive compute.
Integrated intermediate supervision at each stage to guide heatmap prediction and gradient flow.
Used Gaussian-distributed ground truth heatmaps and summed loss across all stages for backpropagation.

Training & Augmentation

Training was conducted using Adam optimizer with learning rates of 1e-4 (CPM) and 1e-3 (HRNet/hybrid), across 30 epochs, with image size 224×224, batch size 32, and sigma=3 for ground truth heatmaps. To combat overfitting and improve generalization:

Applied four data augmentations: random flip, rotation (±15°), scale (0.75–1.25×), and HSV shift.
Observed significant boost in validation accuracy post augmentation.

Results

MODEL	Header 2PCK@0.05 (Total)
CPM	62.16%
HRNet - 32	73.77%
HRNet - 48	77.90%
Ours (HRNet32)	78.44%
Ours (HRNet48)	79.85%

CPM 62.16%
HRNet-32 73.77%HRNet-4877.90%Ours (HRNet32)78.44%Ours (HRNet48)79.85%

Tail and hip keypoints remained challenging across models due to visual ambiguity and inter-species variation.
The hybrid model outperformed all baselines across nearly all keypoints and exhibited stronger robustness to occlusion and anatomical diversity.