On-Device AI: Building Smarter, Faster, And Private Applications

11 min read
AI, Tools, LLM, Techniques, Apps
Share on Twitter, LinkedIn

About The Author

Joas is a machine learning (ML) and artificial intelligence (AI) enthusiast passionate about using these technologies to solve real-world problems. He believes … More about Joas ↬

Email Newsletter

Weekly tips on front-end & UX.
Trusted by 182,000+ folks.

Shouldn’t there be a way to keep your apps or project data private and improve performance by reducing server latency? This is what on-device AI is designed to solve. It handles AI processing locally, right on your device, without connecting to the internet and sending data to the cloud. In this article, Joas Pambou explains what on-device AI is, why it’s important, the tools to build this type of technology, and how it can change the way we use technology every day.

It’s not too far-fetched to say AI is a pretty handy tool that we all rely on for everyday tasks. It handles tasks like recognizing faces, understanding or cloning speech, analyzing large data, and creating personalized app experiences, such as music playlists based on your listening habits or workout plans matched to your progress.

But here’s the catch:

Where AI tool actually lives and does its work matters a lot.

Take self-driving cars, for example. These types of cars need AI to process data from cameras, sensors, and other inputs to make split-second decisions, such as detecting obstacles or adjusting speed for sharp turns. Now, if all that processing depends on the cloud, network latency connection issues could lead to delayed responses or system failures. That’s why the AI should operate directly within the car. This ensures the car responds instantly without needing direct access to the internet.

This is what we call On-Device AI (ODAI). Simply put, ODAI means AI does its job right where you are — on your phone, your car, or your wearable device, and so on — without a real need to connect to the cloud or internet in some cases. More precisely, this kind of setup is categorized as Embedded AI (EMAI), where the intelligence is embedded into the device itself.

Okay, I mentioned ODAI and then EMAI as a subset that falls under the umbrella of ODAI. However, EMAI is slightly different from other terms you might come across, such as Edge AI, Web AI, and Cloud AI. So, what’s the difference? Here’s a quick breakdown:

Edge AI
It refers to running AI models directly on devices instead of relying on remote servers or the cloud. A simple example of this is a security camera that can analyze footage right where it is. It processes everything locally and is close to where the data is collected.
Embedded AI
In this case, AI algorithms are built inside the device or hardware itself, so it functions as if the device has its own mini AI brain. I mentioned self-driving cars earlier — another example is AI-powered drones, which can monitor areas or map terrains. One of the main differences between the two is that EMAI uses dedicated chips integrated with AI models and algorithms to perform intelligent tasks locally.
Cloud AI
This is when the AI lives and relies on the cloud or remote servers. When you use a language translation app, the app sends the text you want to be translated to a cloud-based server, where the AI processes it and the translation back. The entire operation happens in the cloud, so it requires an internet connection to work.
Web AI
These are tools or apps that run in your browser or are part of websites or online platforms. You might see product suggestions that match your preferences based on what you’ve looked at or purchased before. However, these tools often rely on AI models hosted in the cloud to analyze data and generate recommendations.

The main difference? It’s about where the AI does the work: on your device, nearby, or somewhere far off in the cloud or web.

What Makes On-Device AI Useful

On-device AI is, first and foremost, about privacy — keeping your data secure and under your control. It processes everything directly on your device, avoiding the need to send personal data to external servers (cloud). So, what exactly makes this technology worth using?

Real-Time Processing

On-device AI processes data instantly because it doesn’t need to send anything to the cloud. For example, think of a smart doorbell — it recognizes a visitor’s face right away and notifies you. If it had to wait for cloud servers to analyze the image, there’d be a delay, which wouldn’t be practical for quick notifications.

Enhanced Privacy and Security

Picture this: You are opening an app using voice commands or calling a friend and receiving a summary of the conversation afterward. Your phone processes the audio data locally, and the AI system handles everything directly on your device without the help of external servers. This way, your data stays private, secure, and under your control.

Offline Functionality

A big win of ODAI is that it doesn’t need the internet to work, which means it can function even in areas with poor or no connectivity. You can take modern GPS navigation systems in a car as an example; they give you turn-by-turn directions with no signal, making sure you still get where you need to go.

Reduced Latency

ODAI AI skips out the round trip of sending data to the cloud and waiting for a response. This means that when you make a change, like adjusting a setting, the device processes the input immediately, making your experience smoother and more responsive.

The Technical Pieces Of The On-Device AI Puzzle

At its core, ODAI uses special hardware and efficient model designs to carry out tasks directly on devices like smartphones, smartwatches, and Internet of Things (IoT) gadgets. Thanks to the advances in hardware technology, AI can now work locally, especially for tasks requiring AI-specific computer processing, such as the following:

Neural Processing Units (NPUs)
These chips are specifically designed for AI and optimized for neural nets, deep learning, and machine learning applications. They can handle large-scale AI training efficiently while consuming minimal power.
Graphics Processing Units (GPUs)
Known for processing multiple tasks simultaneously, GPUs excel in speeding up AI operations, particularly with massive datasets.

Here’s a look at some innovative AI chips in the industry:

Product	Organization	Key Features
Spiking Neural Network Chip	Indian Institute of Technology	Ultra-low power consumption
Hierarchical Learning Processor	Ceromorphic	Alternative transistor structure
Intelligent Processing Units (IPUs)	Graphcore	Multiple products targeting end devices and cloud
Katana Edge AI	Synaptics	Combines vision, motion, and sound detection
ET-SoC-1 Chip	Esperanto Technology	Built on RISC-V for AI and non-AI workloads
NeuRRAM	CEA–Leti	Biologically inspired neuromorphic processor based on resistive RAM (RRAM)

These chips or AI accelerators show different ways to make devices more efficient, use less power, and run advanced AI tasks.

Techniques For Optimizing AI Models

Creating AI models that fit resource-constrained devices often requires combining clever hardware utilization with techniques to make models smaller and more efficient. I’d like to cover a few choice examples of how teams are optimizing AI for increased performance using less energy.

Meta’s MobileLLM

Meta’s approach to ODAI introduced a model built specifically for smartphones. Instead of scaling traditional models, they designed MobileLLM from scratch to balance efficiency and performance. One key innovation was increasing the number of smaller layers rather than having fewer large ones. This design choice improved the model’s accuracy and speed while keeping it lightweight. You can try out the model either on Hugging Face or using vLLM, a library for LLM inference and serving.

Quantization

This simplifies a model’s internal calculations by using lower-precision numbers, such as 8-bit integers, instead of 32-bit floating-point numbers. Quantization significantly reduces memory requirements and computation costs, often with minimal impact on model accuracy.

Pruning

Neural networks contain many weights (connections between neurons), but not all are crucial. Pruning identifies and removes less important weights, resulting in a smaller, faster model without significant accuracy loss.

Matrix Decomposition

Large matrices are a core component of AI models. Matrix decomposition splits these into smaller matrices, reducing computational complexity while approximating the original model’s behavior.

Knowledge Distillation

This technique involves training a smaller model (the “student”) to mimic the outputs of a larger, pre-trained model (the “teacher”). The smaller model learns to replicate the teacher’s behavior, achieving similar accuracy while being more efficient. For instance, DistilBERT successfully reduced BERT’s size by 40% while retaining 97% of its performance.

Technologies Used For On-Device AI

Well, all the model compression techniques and specialized chips are cool because they’re what make ODAI possible. But what’s even more interesting for us as developers is actually putting these tools to work. This section covers some of the key technologies and frameworks that make ODAI accessible.

MediaPipe Solutions

MediaPipe Solutions is a developer toolkit for adding AI-powered features to apps and devices. It offers cross-platform, customizable tools that are optimized for running AI locally, from real-time video analysis to natural language processing.

At the heart of MediaPipe Solutions is MediaPipe Tasks, a core library that lets developers deploy ML solutions with minimal code. It’s designed for platforms like Android, Python, and Web/JavaScript, so you can easily integrate AI into a wide range of applications.

MediaPipe also provides various specialized tasks for different AI needs:

LLM Inference API
This API runs lightweight large language models (LLMs) entirely on-device for tasks like text generation and summarization. It supports several open models like Gemma and external options like Phi-2.
Object Detection
The tool helps you Identify and locate objects in images or videos, which is ideal for real-time applications like detecting animals, people, or objects right on the device.
Image Segmentation
MediaPipe can also segment images, such as isolating a person from the background in a video feed, allowing it to separate objects in both single images (like photos) and continuous video streams (like live video or recorded footage).

LiteRT

LiteRT or Lite Runtime (previously called TensorFlow Lite) is a lightweight and high-performance runtime designed for ODAI. It supports running pre-trained models or converting TensorFlow, PyTorch, and JAX models to a LiteRT-compatible format using AI Edge tools.

Model Explorer

Model Explorer is a visualization tool that helps you analyze machine learning models and graphs. It simplifies the process of preparing these models for on-device AI deployment, letting you understand the structure of your models and fine-tune them for better performance.

You can use Model Explorer locally or in Colab for testing and experimenting.

ExecuTorch

If you’re familiar with PyTorch, ExecuTorch makes it easy to deploy models to mobile, wearables, and edge devices. It’s part of the PyTorch Edge ecosystem, which supports building AI experiences for edge devices like embedded systems and microcontrollers.

Large Language Models For On-Device AI

Gemini is a powerful AI model that doesn’t just excel in processing text or images. It can also handle multiple types of data seamlessly. The best part? It’s designed to work right on your devices.

For on-device use, there’s Gemini Nano, a lightweight version of the model. It’s built to perform efficiently while keeping everything private.

What can Gemini Nano do?

Call Notes on Pixel devices
This feature creates private summaries and transcripts of conversations. It works entirely on-device, ensuring privacy for everyone involved.

Pixel Recorder app
With the help of Gemini Nano and AICore, the app provides an on-device summarization feature, making it easy to extract key points from recordings.

TalkBack
Enhances the accessibility feature on Android phones by providing clear descriptions of images, thanks to Nano’s multimodal capabilities.

Note: It’s similar to an application we built using LLaVA in a previous article.

Gemini Nano is far from the only language model designed specifically for ODAI. I’ve collected a few others that are worth mentioning:

Model	Developer	Research Paper
Octopus v2	NexaAI	On-device language model for super agent
OpenELM	Apple ML Research	A significant large language model integrated within iOS to enhance application functionalities
Ferret-v2	Apple	Ferret-v2 significantly improves upon its predecessor, introducing enhanced visual processing capabilities and an advanced training regimen
MiniCPM	Tsinghua University	A GPT-4V Level Multimodal LLM on Your Phone
Phi-3	Microsoft	Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

The Trade-Offs of Using On-Device AI

Building AI into devices can be exciting and practical, but it’s not without its challenges. While you may get a lightweight, private solution for your app, there are a few compromises along the way. Here’s a look at some of them:

Limited Resources

Phones, wearables, and similar devices don’t have the same computing power as larger machines. This means AI models must fit within limited storage and memory while running efficiently. Additionally, running AI can drain the battery, so the models need to be optimized to balance power usage and performance.

Data and Updates

AI in devices like drones, self-driving cars, and other similar devices process data quickly, using sensors or lidar to make decisions. However, these models or the system itself don’t usually get real-time updates or additional training unless they are connected to the cloud. Without these updates and regular model training, the system may struggle with new situations.

Biases

Biases in training data are a common challenge in AI, and ODAI models are no exception. These biases can lead to unfair decisions or errors, like misidentifying people. For ODAI, keeping these models fair and reliable means not only addressing these biases during training but also ensuring the solutions work efficiently within the device’s constraints.

These aren’t the only challenges of on-device AI. It’s still a new and growing technology, and the small number of professionals in the field makes it harder to implement.

Conclusion

Choosing between on-device and cloud-based AI comes down to what your application needs most. Here’s a quick comparison to make things clear:

Aspect	On-Device AI	Cloud-Based AI
Privacy	Data stays on the device, ensuring privacy.	Data is sent to the cloud, raising potential privacy concerns.
Latency	Processes instantly with no delay.	Relies on internet speed, which can introduce delays.
Connectivity	Works offline, making it reliable in any setting.	Requires a stable internet connection.
Processing Power	Limited by device hardware.	Leverages the power of cloud servers for complex tasks.
Cost	No ongoing server expenses.	Can incur continuous cloud infrastructure costs.

For apps that need fast processing and strong privacy, ODAI is the way to go. On the other hand, cloud-based AI is better when you need more computing power and frequent updates. The choice depends on your project’s needs and what matters most to you.

(gg, yk)

Explore more on