YOLOs-CPP
Header-only C++ library for real-time YOLO inference — detection, segmentation, pose, OBB — no Python, no runtime bloat
Problem
Running YOLO in robotics and embedded systems typically means a Python runtime, a subprocess boundary, and latency you can’t budget. The ONNX ecosystem promised cross-platform inference, but the reference implementations were Python-first. Teams either lived with the Python overhead or rewrote from scratch every time a new YOLO version dropped.
The deeper problem: YOLO versions v5 through v12 have incompatible output formats. Each update broke existing C++ wrappers. Projects using detection today and adding segmentation tomorrow had to touch two separate codebases.
Approach
Single-header design, one file per task type. Drop yolov8_det.hpp into any CMake project, link ONNX Runtime and OpenCV, and you have a working detector in under fifty lines. No framework lock-in, no package manager step.
The API surface is deliberately narrow: construct with a model path and confidence threshold, call detect(frame), iterate results. The same pattern applies across detection, segmentation, oriented bounding boxes, and pose estimation — switching task types is a one-line change.
Model-agnostic output parsing handles differences between YOLO output formats internally. Adding v12 support required touching only the parser, not any caller code.
Architecture
Each header encapsulates: ONNX session initialisation, pre-processing (resize, normalise, NCHW conversion), inference, and post-processing (NMS, coordinate rescaling). GPU execution paths use the ONNX Runtime CUDA execution provider when available; the same binary falls back to CPU without recompilation.
Quantized models (INT8/FP16) load identically to FP32 — no code changes needed. Sample pipelines cover image files, video streams, and live camera feeds via OpenCV VideoCapture. 36 automated tests gate each release.
Results
Measured on Intel i7-12700H / RTX 3060, 640×640 input, YOLOv11n model:
| Backend | FPS | Latency | Memory |
|---|---|---|---|
| CPU | 15 FPS | 67 ms | 48 MB |
| CUDA (GPU) | 97 FPS | 10 ms | 412 MB |
Additional GPU benchmarks (RTX 3060, 640×640):
| Model | FPS |
|---|---|
| YOLOv8n | 86 FPS |
| YOLO26n | 78 FPS |
| YOLOv11n-seg | 65 FPS |
| YOLOv11n-pose | 80 FPS |
- Supports YOLO v5, v6, v7, v8, v9, v10, v11, v12 in detection, segmentation, OBB, pose, and classification modes
- Zero Python in the inference path — deterministic latency on embedded hardware
- 968 stars on GitHub; used in production robotics perception and industrial inspection
Lessons
Post-processing is where version differences live. YOLO v8 switched from anchor-based to anchor-free heads; v10 added NMS-free variants. Keeping the pre/post-processing logic inside each header rather than a shared base class made these changes easier to isolate and test without regressions.
Header-only simplicity has limits: compile times grow with template depth. Future work: a thin compiled core with the header as a lightweight adaptor.
Stack
Technologies