SmolVLM2-ROS2
On-device Vision-Language Model for robotics — SmolVLM2 running via ONNX Runtime inside a ROS2 node for scene understanding and spatial reasoning
Problem
Large Vision-Language Models can answer open-ended questions about scenes — “is the pallet bay on the left clear?”, “what is the person doing?” — but they require cloud inference or powerful GPUs. Robots operating in warehouses or hospitals need scene understanding that is latency-bounded, privacy-preserving, and functional without network access.
SmolVLM2 hits the right size point: capable enough for spatial reasoning and scene description, small enough to run on edge hardware. The missing piece was a production-ready ROS2 integration that exposes VLM inference as a standard service other nodes can call.
Approach
Export SmolVLM2 to ONNX format and wrap it in a ROS2 node that accepts image messages and text prompts, returning natural-language responses via a ROS2 service. The service interface lets navigation, manipulation, and decision-making nodes query the VLM synchronously when they need scene context.
ONNX Runtime handles cross-platform inference; the same package runs on x86 development machines and ARM robot computers without model conversion. Inference runs entirely offline — no cloud API calls.
Architecture
Image topic + text prompt (service request) → image pre-processing (resize, normalise to model input spec) → vision encoder (ONNX) → language model decoder (ONNX, greedy or beam search) → text response (service reply).
Token generation parameters (max tokens, temperature, beam width) are ROS2 parameters set at launch. The vision encoder and decoder sessions are created once at node activation; inference is stateless between calls. Compatible with ROS2 Humble and Iron; lifecycle-managed for clean start/stop in multi-node compositions.
Results
- Answers spatial and semantic questions about live camera frames using a fully on-device VLM
- No cloud dependency — privacy-preserving for hospital and industrial deployment environments
- Service interface integrates directly with nav2 and manipulation pipelines as a scene-understanding primitive
- ONNX export runs on both x86 and ARM targets without re-conversion
Lessons
VLM output is non-deterministic by design, which conflicts with robotics expectations for deterministic sensor outputs. Wrapping the VLM in a service (request/reply) rather than a topic (fire-and-forget) made it straightforward to add timeouts and retry logic at the caller — keeping the VLM node itself simple while giving callers full control over failure handling.
ONNX export of encoder-decoder models requires splitting the model at the KV-cache boundary and exporting each component separately. This is not documented clearly in Transformers or Optimum; validating the C++ output against the Python reference frame-by-frame was the only reliable verification method.
Stack
Technologies