← Work

YOLOs-CPP-TensorRT

Header-only C++ YOLO library for NVIDIA TensorRT — GPU preprocessing, CUDA Graph replay, sub-2ms latency, 530+ FPS

Creator & Maintainer · 2026 · active · GitHub ↗

Problem

Most YOLO C++ wrappers treat preprocessing as an afterthought — resizing on the CPU, copying synchronously, rebuilding TensorRT launch parameters every frame. On a laptop GPU this costs 1–3ms per frame before inference even begins. When the model itself runs in under 2ms, a CPU preprocessing bottleneck doubles end-to-end latency.

The other gap: TensorRT engines are model-specific and require manual configuration of output tensor shape differences between YOLO versions. Each new YOLO release breaks existing wrappers.

Approach

YOLOs-TRT was built around one principle: the GPU should never wait for the CPU. Every stage of the pipeline that can move to the GPU does — preprocessing runs as a single CUDA kernel, host-to-device transfer uses pinned memory for async overlap, and the entire inference graph is captured once and replayed via cudaGraphLaunch.

Model version auto-detection reads output tensor shapes at engine load time — no manual --model-version flag required. FP32, FP16, and INT8 engines load identically through the same API.

Architecture

GPU letterbox + normalize (single CUDA kernel) → async cudaMemcpyAsync to pinned buffer → cudaGraphLaunch (CUDA Graph replay of enqueueV3) → NMS post-processing → structured output.

The CUDA kernel performs bilinear letterbox resize, BGR→RGB conversion, and /255.0 normalisation in one pass, writing directly into the TRT input buffer. One cudaStream_t drives the full preprocess → infer → postprocess pipeline with minimal sync points.

CUDA Graph capture runs once at model load; fixed-shape engines replay with ~0.1–0.3ms less dispatch overhead per frame than bare enqueueV3.

Results

Measured on NVIDIA RTX 2000 Ada (Laptop) — YOLOv11n · 640×640 · 1000 iterations · 10-iter warm-up:

PrecisionFPSAvg latencyP50P99GPU memory
FP324662.14 ms2.04 ms3.03 ms530 MB
FP164792.09 ms1.98 ms2.91 ms536 MB
INT85301.89 ms1.78 ms2.70 ms444 MB

Numbers include the full pipeline — GPU preprocessing, inference, and post-processing. Scales roughly linearly on higher-end GPUs.

Supported tasks (auto-detected from tensor shape):

TaskYOLO versions
Detectionv5 · v7 · v8 · v9 · v10 · v11 · v12 · v26 · NAS
Segmentationv8-seg · v11-seg · v26-seg
Posev8-pose · v11-pose · v26-pose
OBBv8-obb · v11-obb · v26-obb
Classificationv8-cls · v11-cls · v12-cls · v26-cls
  • 54 stars on GitHub
  • Jetson Xavier/Orin compatible (CC ≥ 7.5)

Lessons

CUDA Graph capture is a significant win but has a hard constraint: it only works with fixed input shapes. The first time a caller changes the input resolution, the graph must be re-captured. Building a small graph cache keyed on (height, width) avoids repeated capture costs for multi-resolution workloads.

TensorRT INT8 calibration quality determines whether INT8 saves memory without accuracy loss. Using a representative calibration dataset from the actual deployment domain (not ImageNet defaults) is the difference between a useful INT8 engine and one that misses detections.

Stack

  • C++17
  • TensorRT ≥10.0
  • CUDA ≥12.0
  • OpenCV 4.5+
  • CMake 3.18+

Technologies

  • C++17
  • TensorRT ≥10
  • CUDA ≥12
  • OpenCV
  • CUDA Graph