YOLOs-CPP

Header-only C++ library for real-time YOLO inference — detection, segmentation, pose, OBB — no Python, no runtime bloat

Creator & Maintainer · 2025 · active · GitHub ↗ · Demo ↗

Problem

Running YOLO in robotics and embedded systems typically means a Python runtime, a subprocess boundary, and latency you can’t budget. The ONNX ecosystem promised cross-platform inference, but the reference implementations were Python-first. Teams either lived with the Python overhead or rewrote from scratch every time a new YOLO version dropped.

The deeper problem: YOLO versions v5 through v12 have incompatible output formats. Each update broke existing C++ wrappers. Projects using detection today and adding segmentation tomorrow had to touch two separate codebases.

Approach

Single-header design, one file per task type. Drop yolov8_det.hpp into any CMake project, link ONNX Runtime and OpenCV, and you have a working detector in under fifty lines. No framework lock-in, no package manager step.

The API surface is deliberately narrow: construct with a model path and confidence threshold, call detect(frame), iterate results. The same pattern applies across detection, segmentation, oriented bounding boxes, and pose estimation — switching task types is a one-line change.

Model-agnostic output parsing handles differences between YOLO output formats internally. Adding v12 support required touching only the parser, not any caller code.

Architecture

Each header encapsulates: ONNX session initialisation, pre-processing (resize, normalise, NCHW conversion), inference, and post-processing (NMS, coordinate rescaling). GPU execution paths use the ONNX Runtime CUDA execution provider when available; the same binary falls back to CPU without recompilation.

Quantized models (INT8/FP16) load identically to FP32 — no code changes needed. Sample pipelines cover image files, video streams, and live camera feeds via OpenCV VideoCapture. 36 automated tests gate each release.

Results

Measured on Intel i7-12700H / RTX 3060, 640×640 input, YOLOv11n model:

Backend	FPS	Latency	Memory
CPU	15 FPS	67 ms	48 MB
CUDA (GPU)	97 FPS	10 ms	412 MB

Additional GPU benchmarks (RTX 3060, 640×640):

Model	FPS
YOLOv8n	86 FPS
YOLO26n	78 FPS
YOLOv11n-seg	65 FPS
YOLOv11n-pose	80 FPS

Supports YOLO v5, v6, v7, v8, v9, v10, v11, v12 in detection, segmentation, OBB, pose, and classification modes
Zero Python in the inference path — deterministic latency on embedded hardware
968 stars on GitHub; used in production robotics perception and industrial inspection

Lessons

Post-processing is where version differences live. YOLO v8 switched from anchor-based to anchor-free heads; v10 added NMS-free variants. Keeping the pre/post-processing logic inside each header rather than a shared base class made these changes easier to isolate and test without regressions.

Header-only simplicity has limits: compile times grow with template depth. Future work: a thin compiled core with the header as a lightweight adaptor.

Stack

View source on GitHub

Watch live detection demo

Technologies