← Work

Depths-CPP

Header-only C++ monocular depth estimation — Depth Anything v2 via ONNX Runtime, real-time on CPU and GPU

Creator & Maintainer · 2024 · active · GitHub ↗

Problem

Stereo cameras and LiDAR provide geometric depth but add cost, weight, and calibration complexity to robot platforms. Monocular depth estimation offers a software-only alternative — one RGB camera, no additional hardware — but the best models (Depth Anything v2) ship as Python packages. Deploying them on embedded robot computers without a Python runtime requires an ONNX export and a C++ inference wrapper that correctly replicates the model’s normalisation pipeline.

Approach

Single-header design following the same pattern as YOLOs-CPP. One header file handles ONNX session setup, pre-processing, inference, and depth map output. Construct with an ONNX model path and a GPU flag, call predict(frame), receive a floating-point depth map as a cv::Mat. The output is ready to pass directly to obstacle avoidance or point-cloud generation code.

Both relative (normalised) and metric depth modes are supported using the appropriate model variant. Colour-mapped visualisation (COLORMAP_INFERNO) is a one-line call. Supports image, video, and live camera inference modes. Multi-threaded architecture with adaptive batch size for throughput-oriented workloads.

Architecture

Input frame → resize to 384×384 → normalise to [0,1] and standardise (ImageNet mean/std) → ONNX inference (CPU, CUDA, or TensorRT execution provider) → HxW float32 depth map → optional COLORMAP_INFERNO visualisation.

Session is created once at construction; inference is stateless. The same binary selects the execution provider at runtime — no recompilation for CPU vs GPU. Dynamic input shape handling accommodates varying resolutions.

Results

Model zoo (all at 384×384 input):

ModelTypeNotes
vits.onnxFP32, relative depthViT-Small, general use
vits_quint8.onnxUINT8 quantizedEdge-optimised, lower memory
vits_metric_indoor.onnxFP32, metric depthCalibrated for indoor scenes
vits_metric_outdoor.onnxFP32, metric depthCalibrated for outdoor scenes
  • TensorRT, CUDA, and CPU execution providers supported in the same binary
  • Runs on Linux, macOS, and Windows; cross-platform CMake build
  • 112 stars on GitHub

Lessons

Normalisation conventions differ between model families and are not documented consistently in the Depth Anything v2 release. The PyTorch export uses a specific mean/std pair that differs from the standard ImageNet values used by the ViT backbone. Testing the C++ pre-processing against the Python reference implementation frame-by-frame — comparing depth map values numerically, not just visually — was the only reliable verification method.

Metric depth models require the correct indoor/outdoor variant for the deployment environment. Using an outdoor model indoors (or vice versa) produces plausible-looking but numerically wrong depth values — a failure mode that visual inspection alone will not catch.

Stack

  • C++17
  • ONNX Runtime 1.16–1.19
  • OpenCV 4.5+
  • CUDA Toolkit 11.0+
  • CMake 3.14+

Technologies

  • C++17
  • ONNX Runtime
  • OpenCV
  • Depth Anything v2
  • CUDA