Primate pose estimation: CPM + HRNet Hybrid Architecture

This project tackled the OpenMonkeyChallenge, which aims to estimate 17 anatomical keypoints from single images of various monkey species. We developed and benchmarked a deep learning pipeline for markerless pose estimation using convolutional pose machines (CPMs), High-Resolution Networks (HRNet), and a novel hybrid CPM-HRNet model.

This project tackled the OpenMonkeyChallenge, which aims to estimate 17 anatomical keypoints from single images of various monkey species. We developed and benchmarked a deep learning pipeline for markerless pose estimation using convolutional pose machines (CPMs), High-Resolution Networks (HRNet), and a novel hybrid CPM-HRNet model.

Problem Scope and Dataset

The task involved single-instance keypoint detection using a dataset of 112,000+ images from 26 non-human primate species. Ground truth annotations included 17 anatomical landmarks per monkey. The dataset was split into 66,917 training images and 22,306 validation/test images. We used top-down pose estimation: cropping monkey instances using bounding boxes before applying keypoint prediction.

Baseline Models: CPM and HRNet

  • CPMs use a staged architecture where heatmaps generated from one stage refine the predictions of the next. Intermediate supervision improves convergence and helps handle occlusions or visual ambiguities.

  • HRNet, a state-of-the-art backbone, maintains high-resolution feature maps throughout, enabling accurate spatial localization of keypoints.

We trained both models from scratch, tuning hyperparameters and performing inference using multi-stage heatmap regression. Evaluation used PCK@0.05 (Probability of Correct Keypoint) as the metric.

Proposed Model: CPM-HRNet Hybrid

To address the diversity and complexity of monkey poses and species-specific anatomical variations, we proposed a CPM-HRNet hybrid architecture:

  • Replaced the initial CPM feature extractor with HRNet32 or HRNet48, retaining high-resolution contextual features.

  • Replaced later convolutional blocks in CPM with a custom “small-HRNet” module to increase model capacity without excessive compute.

  • Integrated intermediate supervision at each stage to guide heatmap prediction and gradient flow.

  • Used Gaussian-distributed ground truth heatmaps and summed loss across all stages for backpropagation.

Training & Augmentation

Training was conducted using Adam optimizer with learning rates of 1e-4 (CPM) and 1e-3 (HRNet/hybrid), across 30 epochs, with image size 224×224, batch size 32, and sigma=3 for ground truth heatmaps. To combat overfitting and improve generalization:

  • Applied four data augmentations: random flip, rotation (±15°), scale (0.75–1.25×), and HSV shift.

  • Observed significant boost in validation accuracy post augmentation.

Results

MODEL

Header 2PCK@0.05 (Total)

CPM

62.16%

HRNet - 32

73.77%

HRNet - 48

77.90%

Ours (HRNet32)

78.44%

Ours (HRNet48)

79.85%


CPM 62.16%
HRNet-32 73.77%HRNet-4877.90%Ours (HRNet32)78.44%Ours (HRNet48)79.85%

  • Tail and hip keypoints remained challenging across models due to visual ambiguity and inter-species variation.

  • The hybrid model outperformed all baselines across nearly all keypoints and exhibited stronger robustness to occlusion and anatomical diversity.