On-Device AI

MLX Embedders, Text Embeddings on Apple Silicon with Swift

MLX Embedders in Swift: On-Device Text Embeddings for iOS

In Part 1 of this series, we built a minimal LLM inference pipeline on Apple Silicon using MLX Swift. In Part 2, we quantized a model from scratch and saw how 4-bit precision makes billion-parameter models tractable on a phone. This article introduces MLX Embedders, specifically the MLXEmbedders Swift library, and takes a different angle. Instead of generating text, we are going to encode meaning. A user types: “best hikes near a volcano.” A keyword search returns nothing useful. An embedding-based system returns exactly what they need, because it understands what the words mean, not just what they spell. That capability is what embeddings unlock, and it is the foundation of every serious AI feature in production today: semantic search, recommendation systems, RAG pipelines, and clustering. ...

Quantization in LLMs for mobile developers

Quantization in LLMs: How to Run AI on Your iPhone Without Burning It

In Part 1 of this series, we set up the MLX ecosystem and ran a language model locally on Apple Silicon. If you haven’t read it yet, it’s worth starting there. This article tackles the question that naturally follows: how do you fit a multi-billion parameter model into a device with 8 GB of RAM? The answer is quantization, and understanding it will change how you think about on-device AI. Introduction: LLMs Are Just Very Large Matrices Before we can explain quantization, we need to be clear about what we’re actually compressing. ...

MLX Swift, On-Device Large Language Models on Apple Silicon

MLX Swift: Enabling On-Device Large Language Models on Apple Silicon

Abstract The proliferation of large-scale neural language models has, until recently, been contingent upon access to remote computational infrastructure. The architectural characteristics of Apple Silicon, most notably its unified memory subsystem, present a substantive departure from this dependency. This article examines MLX Swift, a native Swift binding to Apple’s MLX machine learning framework, as a mechanism for deploying quantized Large Language Models (LLMs) directly on consumer Apple hardware. We characterize the layered architecture of the MLX ecosystem, contrast its design philosophy with that of Apple’s Foundation Models API, and present a reference implementation demonstrating the complete inference lifecycle: model acquisition, session initialization, and autoregressive text generation. The discussion is grounded in the computational properties of unified memory and their implications for on-device inference efficiency. ...