The Raspberry Pi has matured into a viable edge-AI platform โ but which models actually fit, how fast do they run, and how do you deploy them reliably in production? This guide focuses specifically on Pi: model selection, honest performance expectations, optimisation techniques, and real deployments from our own shipped products. For the broader question of when to run AI on-device versus in the cloud, see our edge AI vs cloud AI comparison.


The Raspberry Pi 4 brought edge AI within reach for embedded product teams โ full Linux, Python, and enough RAM to load small models. The Pi 5 made it practical for production. The BCM2712 SoC packs quad-core ARM Cortex-A76 cores running at 2.4 GHz, paired with up to 8 GB of LPDDR4X RAM. That is not a microcontroller's compute headroom โ it is a substantial step up from the Cortex-A72 in the Pi 4, and the difference shows up directly in inference latency.
Raspberry Pi also now has a first-party AI acceleration path. The Hailo-8L (13 TOPS) is available via the Raspberry Pi AI Kit and AI HAT+, and the Hailo-8 (26 TOPS) via the AI HAT+, both connecting over the Pi 5's M.2 slot. These are not required for lightweight models, but they change the ceiling for higher-throughput computer vision workloads substantially. Per the Raspberry Pi Foundation's published documentation, the AI HAT+ with Hailo-8 enables real-time object detection at resolutions and frame rates that Pi CPU alone cannot sustain.
Pi is not a Jetson, not a workstation. For multi-stream HD object detection or large language model inference, it is not the right platform. But for a wide class of production tasks โ computer vision at modest frame rates, face recognition for access control, small speech models, sensor-based anomaly detection โ Pi 5 is the right cost/power/capability point: mains-powered, a ยฃ60โยฃ80 board, the full Linux ecosystem, and standard Python libraries. For a broader view of where embedded AI is heading across all platforms, the broader future of embedded AI covers the landscape from microcontrollers through to the Pi class and beyond.
YOLOv8n (nano) is the standard starting point for object detection on Pi. It is the smallest variant in the YOLOv8 family and the one Ultralytics explicitly positions for resource-constrained deployment โ the variant name "n" for nano is the signal. On Pi 5 with an optimised runtime (NCNN or TFLite rather than PyTorch), it achieves frame rates suitable for many production applications. YOLOv8s, YOLOv8m, and larger variants are significantly slower โ on Pi CPU alone they are not suitable for real-time use. If you are starting a new CV project on Pi, begin with YOLOv8n and only step up to a larger variant if accuracy profiling on your specific dataset demands it.
YOLOv5n remains in use on legacy deployments and is a valid choice if your tooling is already built around it. The same size constraint applies: nano variant only for real-time Pi inference.
MobileNet-SSD and other MobileNet-family classifiers are well-suited to Pi via TFLite. Designed for mobile and edge from the start, they are efficient on ARM and integrate naturally with the Coral USB Accelerator if you want hardware-assisted throughput on TFLite models.
OpenCV with the face_recognition library (dlib-based) is the practical choice for face recognition authentication tasks โ not continuous high-FPS tracking, but event-driven recognition (door approach, button press) where you process one frame per trigger. On Pi 5, the dlib recognition pipeline is suitable for access-control response times. This is what we use in our FZLockers deployment.
Whisper tiny / base via whisper.cpp is the standard approach for speech-to-text on Pi. whisper.cpp is a C++ port of OpenAI's Whisper that runs significantly faster than the original Python implementation, making the tiny and base model variants practical on Pi 5 hardware. For lighter wake-word detection where Whisper's footprint is excessive, Vosk and Porcupine are alternatives with lower resource requirements and sub-second latency.
Scikit-learn models โ random forests, gradient boosting, SVMs โ exported to ONNX or Joblib are extremely fast on Pi: millisecond-class inference for tabular sensor data. Anomaly detection, predictive maintenance classification, and real-time sensor stream processing are all well within Pi's capability without any hardware acceleration. This category covers a large proportion of real production edge AI use cases and is often overlooked in favour of flashier computer vision work.
Small quantised LLMs โ TinyLlama 1.1B or Phi-2 at Q4 quantisation via llama.cpp โ will run on Pi 5 with 8 GB RAM. They are slow: expect single-digit tokens per second. This is batch-only territory, not interactive. A Pi running a local LLM can serve use cases where response latency is not critical โ generating a short status report overnight, for example. For anything requiring conversational response times, Pi is not the right inference platform. Run the LLM server-side and use the Pi as the interface and orchestration layer.
Real performance depends on model variant, input resolution, runtime format, and whether a hardware accelerator is present. The table below uses qualitative descriptors and source attributions rather than specific numbers we have not independently measured โ the most harmful thing in an engineering guide is a benchmark you cannot reproduce. Where Ultralytics or the Raspberry Pi Foundation have published benchmarks, we note the source and point to their documentation for the specific figures.
| Model / Library | Task | Pi Hardware | Performance Notes |
|---|---|---|---|
| YOLOv8n (NCNN or TFLite) | Object detection | Pi 5 (4 GB) | Near-real-time for many production applications. Per Ultralytics documentation, YOLOv8n is the recommended variant for resource-constrained deployment. Use NCNN or TFLite runtime โ not PyTorch. |
| YOLOv8n (NCNN or TFLite) | Object detection | Pi 4 (4 GB) | Slower than Pi 5; may meet application requirements at reduced resolution. Pi 5 is strongly preferred for CV workloads โ the Cortex-A76 vs A72 gap is significant for inference. |
| YOLOv8s / m / l | Object detection | Pi 5 (CPU only) | Not suitable for real-time use on Pi CPU alone. Use YOLOv8n, or add Hailo HAT acceleration if a larger model is required for accuracy. |
| OpenCV + face_recognition (dlib) | Face recognition | Pi 5 | Suitable for event-driven access control at the response times required for door-entry or locker-access use cases. Not designed for continuous high-FPS tracking. Verified in our FZLockers deployment. |
| MobileNet-SSD (TFLite) | Object classification | Pi 4 / Pi 5 | Designed for mobile and edge from the ground up; efficient on ARM. Good baseline for classification tasks. Compatible with Coral USB Accelerator for higher throughput. |
| Whisper tiny (whisper.cpp) | Speech-to-text | Pi 5 | Faster than real-time for short utterances. Practical for command recognition and short-form transcription. whisper.cpp (C++) is substantially faster than Python Whisper for Pi. |
| Whisper base (whisper.cpp) | Speech-to-text | Pi 5 | Near-real-time to slightly above real-time for short-to-medium utterances. Higher accuracy than tiny; still practical for production short-form use cases on Pi 5. |
| scikit-learn / ONNX | Sensor anomaly detection / classification | Pi 4 / Pi 5 | Millisecond-class inference for tabular sensor data. No practical performance concern at any Pi version. Well within CPU capability without acceleration. |
| TinyLlama 1.1B Q4 (llama.cpp) | Text generation (LLM) | Pi 5 (8 GB) | Runs but slow โ single-digit tokens per second. Batch or offline workloads only; not suitable for interactive chat. Use server-side LLM inference for production interactive use. |
| YOLOv8n + Hailo HAT+ | Object detection | Pi 5 + Hailo-8 (26 TOPS) | Substantially higher throughput than CPU alone. Per Raspberry Pi Foundation AI HAT+ documentation, real-time detection at higher resolutions and frame rates is achievable. See Pi Foundation benchmarks for model-specific figures. |
All performance descriptors ("near-real-time", "faster than real-time") reflect models running at production-relevant input sizes, not minimal test cases. Profile on your specific hardware and input pipeline before finalising platform decisions.
Edge AI on Pi is not theoretical for us. The following are confirmed shipped deployments and capabilities from DigitalMonk's production work.

A Pi 5-based smart locker system deployed at the client's site in the UK. Users approach a locker bank and are authenticated via face recognition โ no card, no PIN required. We built the computer vision pipeline using OpenCV and the face_recognition library (dlib-based). The Pi 5 handles recognition within a response window suitable for locker-access use.
Pi was the right platform here: the full software stack โ Python, OpenCV, dlib, SQLite for the access log, a local management interface โ runs naturally on Linux. The images shown were provided by the client from the deployed site.
Raspberry Pi 5OpenCVface_recognitionPythonAccess Control
Converting standard refrigerators into smart vending units. A Pi 5 with a camera watches the fridge interior. When a customer opens the door and removes an item, a lightweight YOLOv8 model running on the Pi identifies the product, triggers the solenoid lock sequence, and debits the customer's account โ all within the door-close window.
On-device inference is non-negotiable here: the unlock and debit action must complete within the door interaction window. A cloud round-trip would add unacceptable latency and fail entirely when connectivity drops. Local Pi inference eliminates both risks.
Raspberry Pi 5YOLOv8Object DetectionSolenoid ControlDeploying an AI model to Pi is a pipeline, not a single step. Getting each stage right โ from model selection through runtime choice to the capture pipeline โ is what separates a prototype that works on a desk from a system that holds up in production. The diagram below shows the six stages every Pi AI deployment goes through.
You start with a model: either a pre-trained one (YOLOv8n weights, Whisper tiny, an OpenCV classifier) or one trained on your own data using cloud GPU resources. That model almost certainly is not ready to run on Pi in its native format โ it needs to be optimised. Quantisation (FP32 to INT8) cuts memory footprint and speeds up inference with a small, usually acceptable accuracy trade-off. Format conversion โ from PyTorch or TensorFlow to ONNX, then to TFLite or NCNN โ is necessary to reach the most efficient Pi runtime. Once optimised, the model file lands on the Pi alongside the Python or C++ runtime that runs it. The input pipeline (camera, mic, sensors) feeds frames or audio into inference, and inference drives the application action: unlocking, logging, transcribing, dispensing.
Fig. 1 โ The six-stage AI deployment pipeline on Raspberry Pi. Stage 2 (optimisation) and Stage 5 (runtime format) are the decisions most teams underestimate; they have the largest impact on production inference performance.
These are the techniques that make the most measurable difference in practice, drawn from building Pi AI systems in production.
Converting from FP32 to INT8 quantisation typically halves memory footprint and significantly speeds up inference, with a small and usually acceptable accuracy trade-off. For most production CV tasks the accuracy loss is imperceptible; for sensor-data anomaly detection it is typically negligible. Quantisation is not optional on Pi โ it is table stakes for any model that needs to run at production throughput.
PyTorch and TensorFlow are training frameworks โ not efficient inference runtimes for ARM. Convert to ONNX first, then to TFLite or NCNN for Pi deployment. NCNN is particularly efficient on ARM CPUs. whisper.cpp is the correct runtime for Whisper models โ not Python Whisper. Each format choice has a measurable impact on inference latency; don't benchmark in PyTorch and deploy in PyTorch.
The jump from YOLOv8n to YOLOv8s is a meaningful accuracy gain โ and a meaningful performance cost. Establish your minimum acceptable accuracy on your specific dataset and test the smallest model that hits it. Many teams default to larger variants unnecessarily; YOLOv8n often meets production requirements and runs substantially faster. The same logic applies to Whisper: tiny before base, base before small.
Camera capture, resizing, and pre-processing are a significant portion of total pipeline latency โ not just inference. Use V4L2 or libcamera directly rather than OpenCV's default capture backend where possible. GStreamer pipelines with hardware-accelerated decode improve throughput. Profile the full pipeline from frame grab to result, not just inference time in isolation.
Running frame capture and inference in the same thread means each blocks the other. Put capture on a producer thread feeding a bounded queue; inference runs on a consumer thread pulling from it. This decouples the two pipelines, prevents frame drops during slower inference frames, and makes the overall system more responsive under variable load.
Raspberry Pi 5 will throttle under sustained CPU-intensive workloads without active cooling. Running continuous inference โ the norm in CV deployments โ is exactly the kind of sustained load that triggers thermal throttling and unpredictable latency. Use the official Pi 5 active cooler or an equivalent heatsink-with-fan solution. Active cooling is a deployment requirement, not a nice-to-have.
Loading a model from disk allocates weights, initialises buffers, and compiles the computation graph โ this takes seconds. In a production service, load once at startup and keep it resident in memory. Reloading per request is a common mistake that makes performance appear 10โ100ร worse than the model actually is. Use a persistent service (systemd unit) rather than a script that exits between runs.
If profiling shows the Pi CPU cannot meet your target frame rate or latency, the Hailo HAT (13 TOPS) or Hailo HAT+ (26 TOPS) are the first-party Pi acceleration options for CV workloads. The Coral USB Accelerator is an alternative for TFLite-compatible models. Measure first โ many applications find Pi 5 CPU alone is sufficient, and hardware acceleration adds cost and integration complexity that is only justified when profiling proves it is needed.
Measuring Pi inference performance with a raw FP32 PyTorch model gives you a worst-case number, not a production-representative one. Convert and quantise first; then benchmark. Teams that skip this step incorrectly conclude Pi is too slow.
Pi is not a $3,000 GPU workstation. Define the minimum acceptable frame rate for your application โ not the maximum possible โ and test against that. Many production use cases need 2โ10 FPS, not 60.
A Pi 5 running sustained inference without active cooling will throttle and produce inconsistent latency. Thermals are a deployment engineering requirement, not an afterthought. Test under sustained load โ not just for five seconds.
TFLite, NCNN, OpenCV, and dlib bindings change across versions and can break each other. Pin your virtualenv or use a locked container image for production deployments. Uncontrolled package upgrades on a deployed Pi will break inference quietly.
Edge AI systems require ongoing monitoring: inference latency, model accuracy over time as environmental conditions change, and hardware health (temperature, uptime). A model that was accurate at launch can degrade. Build structured logging and alerting in from day one.
We don't just write Python inference scripts on Pi. Our edge AI engagements cover the full stack: model selection and optimisation, runtime format conversion, the camera capture and inference pipeline, thermal design, hardware enclosure, and the cloud backend the Pi reports to. FZLockers and the smart vending conversion are not demos โ they are production systems running at client sites.
If you are evaluating whether Pi is the right platform for your AI product, or have already committed to Pi and need engineering support to get it to production, the hire Pi developers with edge AI experience page details our engagements and how we work. For the broader scope of what we build across embedded AI platforms, see our embedded AI development services.
Model optimisation through production deployment โ Pi AI, firmware, cloud backend, and hardware design. UK, US, and India.
DigitalMonk has shipped Pi-based AI systems in production โ computer vision for access control, on-device object detection for smart vending, and voice AI for client products. Tell us what you are building.
Talk to our Raspberry Pi developers โ