In Part 1 of this series, we built a minimal LLM inference pipeline on Apple Silicon using MLX Swift. In Part 2, we quantized a model from scratch and saw how 4-bit precision makes billion-parameter models tractable on a phone.

This article takes a different angle. Instead of generating text, we are going to encode meaning.

A user types: “best hikes near a volcano.” A keyword search returns nothing useful. An embedding-based system returns exactly what they need, because it understands what the words mean — not just what they spell. That capability is what embeddings unlock, and it is the foundation of every serious AI feature in production today: semantic search, recommendation systems, RAG pipelines, and clustering.

The tool for this on Apple Silicon is MLXEmbedders, a Swift library that runs embedding models natively on-device via the MLX framework. This article explains the full theory, the complete pipeline, and the architecture of MLXEmbedders itself — so you can use it correctly in production.

1. Embeddings vs. Large Language Models

Before touching any code, it is worth being precise about what an embedding model is — and what it is not.

A large language model performs a sequence-to-sequence transformation: it takes a sequence of tokens as input and produces another sequence of tokens as output. Its job is generation.

An embedding model performs a sequence-to-vector transformation: it takes a sequence of tokens as input and produces a single dense numerical vector as output. Its job is representation.

LLM:        [tokens] → [tokens]   (generative)
Embedder:   [tokens] → [vector]   (representational)

This distinction explains why MLXEmbedders is a separate library from MLX Swift LM. The two share underlying infrastructure — tokenization, weight loading, transformer layers — but their output contracts and pooling logic are fundamentally different. Using an LLM where you need an embedder, or vice versa, is not just a style error; it produces meaningless results.

The separation also has a practical implication: embedding models are typically far smaller than LLMs. A capable model like nomic-embed-text has roughly 137M parameters and a 768-dimensional output space. It fits in under 300 MB and runs at thousands of embeddings per second on Apple Silicon. The compute profile is completely different from a 7B autoregressive model.

2. The Mental Model: Position in Meaning Space

The most useful way to think about an embedding is not as data — it is as a position.

An embedding model has been trained so that semantically similar inputs end up geometrically close in the output vector space. Two sentences that express the same idea will produce vectors that point in nearly the same direction, regardless of the words used. Two sentences on unrelated topics will produce vectors that are far apart, or even orthogonal.

This means that once you have a set of embedded items — documents, product descriptions, user queries, code snippets — the entire problem of finding similar things reduces to finding nearby points. That is a geometry problem, and geometry scales.

Everything else that embeddings enable — semantic search, clustering, classification, RAG — is built on top of this one property.

3. The Full Embedding Pipeline

When you call an embedding model, several distinct transformations happen in sequence. Understanding each step is necessary for using the system correctly, particularly when it comes to pooling.

3.1 Tokenization

The input text is first decomposed into tokens — subword units drawn from the model’s vocabulary. For most modern embedding models, this uses byte-pair encoding (BPE) or WordPiece, the same mechanisms used by BERT-family and GPT-family models.

One important consequence: a sentence like “unmistakable” may be split into ["un", "##mis", "##tak", "##able"]. The model never sees whole words — it sees token IDs, which are integer indices into a learned vocabulary table.

3.2 Token Embeddings

Each token ID is mapped to a learned vector via an embedding lookup table — a weight matrix of shape [vocab_size × hidden_dim]. This is where the first semantic information enters the pipeline: each token has a position in the model’s internal representation space.

Positional embeddings (either absolute or rotary, depending on the architecture) are added at this stage to encode the order of tokens in the sequence.

3.3 Transformer Encoding

The embedded tokens pass through a stack of transformer blocks. Each block applies multi-head self-attention followed by a feed-forward projection, with layer normalization. The attention mechanism allows every token to attend to every other token in the sequence, building up a contextually enriched representation.

After the final transformer layer, you have a hidden state matrix of shape [sequence_length × hidden_dim] — one vector per token, each informed by the full context of the sequence.

This is the step that makes embedding models qualitatively different from simple word2vec-style lookup tables. The representation of “bank” in “river bank” is a different vector from “bank” in “bank account”, because the transformer has incorporated the surrounding context.

3.4 Pooling

The transformer output is a matrix — one vector per token. But we need a single vector for the whole sequence. Pooling is the operation that collapses this matrix into a single vector.

This is the most consequential step in the pipeline, and the one most commonly misapplied.

There are two principal strategies:

CLS token pooling. In BERT-style architectures, a special [CLS] token is prepended to the input. The model is trained so that the output vector at position 0 (the [CLS] position) represents the meaning of the entire sequence. Sentence-BERT and many BERT-fine-tuned models use this strategy.

Mean pooling. The hidden states of all non-padding tokens are averaged. This is often more robust than CLS pooling because it aggregates information from the entire sequence rather than relying on a single learned summarizer. Models like nomic-embed-text and all-MiniLM-L6-v2 use mean pooling.

The critical rule: use the pooling strategy the model was trained with. This is documented on the model’s Hugging Face model card. Using mean pooling on a model trained with CLS pooling — or ignoring the attention mask in mean pooling — produces vectors that are numerically plausible but semantically meaningless. There is no error; the system will silently return bad embeddings.

⚠️ Pooling Mismatch Is Silent
Wrong pooling does not produce an error. It produces vectors that look valid — they have the right dimensionality and numerical range — but do not reflect the actual meaning of the input. Always verify the pooling strategy against the model card.

3.5 Normalization

After pooling, the resulting vector is typically L2-normalized — scaled so that its Euclidean norm equals 1. This is a prerequisite for efficient cosine similarity computation: once all vectors are unit-length, cosine similarity simplifies to a dot product, which is both faster and numerically more stable.

Most production embedding models are trained with normalized output vectors. MLXEmbedders handles normalization automatically for supported models, but you should verify this when working with custom model configurations.

4. Similarity: The Core Operation

Once you have embedding vectors, the primary operation you perform on them is measuring similarity.

Cosine similarity is the standard metric for comparing embedding vectors:

func cosineSimilarity(_ a: [Float], _ b: [Float]) -> Float {
    let dot = zip(a, b).map(*).reduce(0, +)
    let normA = sqrt(a.map { $0 * $0 }.reduce(0, +))
    let normB = sqrt(b.map { $0 * $0 }.reduce(0, +))
    guard normA > 0, normB > 0 else { return 0 }
    return dot / (normA * normB)
}

For L2-normalized vectors, the denominator is always 1 by construction, reducing this to a pure dot product — which is why normalization is done at the model level rather than leaving it to the caller.

The score range has clear semantics:

Score Interpretation
1.0 Identical meaning
0.7–0.9 Highly similar
0.3–0.6 Related
~0.0 Unrelated
−1.0 Opposite (rare in practice)

Dot product and Euclidean distance are also used in some systems, but cosine similarity is the correct choice for normalized embedding vectors and should be the default unless you have a specific reason to deviate.

5. The MLXEmbedders Architecture

MLXEmbedders is part of the mlx-swift-lm package. Its design reflects a clean separation of three concerns: model resolution, weight loading, and concurrency-safe execution.

5.1 ModelTypeRegistry

The registry maps the model_type string from a model’s config.json to the corresponding Swift model implementation. When you load a nomic-embed-text model, the registry resolves "nomic_bert" to its architecture class. This is the extensibility point: adding support for a new architecture requires registering a new type, not modifying the loading infrastructure.

5.2 GenericModelFactory

The factory handles the loading sequence for any registered model type:

  1. Resolve the model configuration from config.json
  2. Look up the model class in the registry
  3. Load weight tensors from .safetensors files into unified memory
  4. Initialize the tokenizer from the vocabulary and config files
  5. Configure the pooling strategy as specified in the model’s configuration
  6. Return a fully initialized model ready for inference

5.3 ModelContainer (Actor)

The ModelContainer wraps the loaded model in a Swift actor, enforcing serial access to the underlying MLX state. This is the correct concurrency design for on-device inference: MLX operations are not thread-safe by default, and the actor boundary ensures that only one inference call executes at a time, without requiring explicit locking in application code.

public actor ModelContainer {
    let model: EmbeddingModel
    let tokenizer: Tokenizer

    public func encode(_ texts: [String]) async throws -> [[Float]] {
        // Inference runs here, serialized by the actor
    }
}

This design is analogous to ChatSession in MLX Swift LM, and for the same reason: it places the concurrency guarantee at the infrastructure level rather than the application level, reducing the surface area for data race bugs.

6. What loadModelContainer Actually Does

The entry point for using MLXEmbedders in application code is loadModelContainer. A single call conceals a substantial orchestration sequence:

let container = try await loadModelContainer(
    configuration: ModelConfiguration(id: "mlx-community/nomic-embed-text-v1.5")
)

Behind this one line, the framework performs the following:

  1. Download — if the model files are not already cached locally, they are fetched from the Hugging Face Hub. Only the files required for inference are downloaded: weight shards, config.json, tokenizer files.
  2. Parseconfig.json is deserialized to extract the model architecture, hidden dimensions, layer count, and pooling configuration.
  3. Resolve — the model_type field is matched against ModelTypeRegistry to identify the correct Swift model class.
  4. Load weights.safetensors files are mapped directly into unified memory. Because unified memory is shared between the CPU and GPU, the loaded tensors are immediately addressable by Metal compute kernels without any copy.
  5. Initialize tokenizer — the vocabulary and merge rules are loaded from tokenizer.json and tokenizer_config.json.
  6. Configure pooling — the pooling strategy (CLS or mean) is extracted from the model configuration and wired into the inference path.
  7. Wrap in actor — the assembled model and tokenizer are placed inside a ModelContainer actor and returned to the caller.

This is not a simple file load. It is full model orchestration — equivalent to what a Python SentenceTransformers pipeline does on the server side, running entirely on-device in a single Swift call.

7. Reference Implementation

The following implementation demonstrates a complete semantic search feature: model loading, embedding a corpus of documents, embedding a user query, and returning ranked results by cosine similarity.

7.1 Package Dependency

Add mlx-swift-lm to your project via Swift Package Manager:

https://github.com/ml-explore/mlx-swift-lm

Select the MLXEmbedders library target.

7.2 Loading the Model

import MLXEmbedders

@MainActor
class EmbeddingService: ObservableObject {
    @Published var isLoading = false
    private var container: ModelContainer?

    func loadModel() async {
        isLoading = true
        defer { isLoading = false }

        do {
            container = try await loadModelContainer(
                configuration: ModelConfiguration(
                    id: "mlx-community/nomic-embed-text-v1.5"
                )
            )
        } catch {
            print("Model load failed: \(error)")
        }
    }
}

The ModelConfiguration type accepts both Hugging Face Hub identifiers (for remote download) and local directory paths (for bundled or pre-downloaded models). The download is cached after the first run; subsequent launches are near-instantaneous.

7.3 Embedding a Corpus

func embedDocuments(_ documents: [String]) async throws -> [[Float]] {
    guard let container else { throw EmbeddingError.modelNotLoaded }

    return try await container.encode(documents)
}

encode(_:) accepts a batch of strings and returns a 2D array: one embedding vector per input string. Processing all inputs in a single call is essential for performance — it allows MLX to pad sequences to a uniform length, process them as a single tensor batch, and maximize GPU utilization. Calling encode in a loop for individual strings serializes what could be parallel work and typically reduces throughput by 5–10×.

struct SearchResult {
    let document: String
    let score: Float
}

func search(query: String, documents: [String], documentEmbeddings: [[Float]]) async throws -> [SearchResult] {
    guard let container else { throw EmbeddingError.modelNotLoaded }

    let queryEmbedding = try await container.encode([query])[0]

    return zip(documents, documentEmbeddings)
        .map { doc, embedding in
            SearchResult(
                document: doc,
                score: cosineSimilarity(queryEmbedding, embedding)
            )
        }
        .sorted { $0.score > $1.score }
}

private func cosineSimilarity(_ a: [Float], _ b: [Float]) -> Float {
    zip(a, b).map(*).reduce(0, +) // valid for L2-normalized vectors
}

In a production deployment, the document embeddings are pre-computed and persisted — typically in a vector database or a serialized .safetensors file. At query time, only the query needs to be embedded, making the latency of a semantic search query comparable to a simple index lookup.

8. Performance Considerations

8.1 Batching is Mandatory for Throughput

The single most impactful performance decision in an embedding pipeline is batching. Each call to encode involves dispatching Metal compute kernels, padding sequences, and transferring results. The fixed overhead of this dispatch is constant regardless of batch size. Amortizing it over 32 or 64 inputs rather than 1 produces proportionally higher throughput.

For an application that needs to embed a knowledge base of 10,000 documents, the difference between single-item calls and batched calls of size 64 is typically the difference between minutes and seconds.

// Correct: embed all documents in one call
let embeddings = try await container.encode(documents)

// Avoid: serial single-item calls
let embeddings = try await documents.asyncMap {
    try await container.encode([$0])[0]
}

8.2 Model Size vs. Quality Trade-offs

Not all embedding models are the same. The relevant dimensions for mobile deployment are:

Model Parameters Output Dim Size Quality
all-MiniLM-L6-v2 22M 384 ~85 MB Good for general use
nomic-embed-text-v1.5 137M 768 ~290 MB High quality, recommended
e5-large-v2 335M 1024 ~670 MB Research-grade quality

For most mobile deployments, nomic-embed-text-v1.5 is the right balance: strong semantic quality, reasonable size, and full support in the MLX community ecosystem.

8.3 Embedding Persistence

Pre-computed embeddings should be persisted across launches. Serializing [[Float]] to a binary file or a local SQLite database is straightforward and avoids re-encoding the same corpus on every startup.

func save(embeddings: [[Float]], to url: URL) throws {
    let data = try JSONEncoder().encode(embeddings)
    try data.write(to: url)
}

func load(from url: URL) throws -> [[Float]] {
    let data = try Data(contentsOf: url)
    return try JSONDecoder().decode([[Float]].self, from: data)
}

For larger corpora, a purpose-built vector store (even a simple in-memory HNSW index) will outperform linear scan over stored vectors, but for collections under ~50,000 items, brute-force cosine comparison over pre-loaded vectors is fast enough for interactive use.

9. Practical Use Cases

Embeddings are not an experimental capability. They are the foundational primitive behind a family of features that are standard in production applications.

Semantic search. The canonical use case. Index a knowledge base, product catalogue, or document collection as embedding vectors. At query time, embed the user’s query and return the nearest neighbours by cosine similarity. Unlike BM25 or TF-IDF keyword search, this approach handles synonyms, paraphrasing, and cross-lingual queries naturally.

Retrieval-Augmented Generation (RAG). Before calling an LLM (via MLX Swift LM), retrieve the top-k semantically relevant documents and include them in the prompt context. This grounds the model’s responses in specific, current information without fine-tuning. The embedding stage is the retrieval mechanism.

Recommendation systems. Embed user interaction history and item descriptions in the same vector space. Items whose embeddings are close to a user’s embedding history are likely to be relevant. This approach scales to millions of items and works with any modality that can be embedded.

Clustering and topic modelling. Apply k-means or hierarchical clustering to a set of embeddings to discover structure in unstructured text data — support tickets, user feedback, app reviews.

Duplicate detection. Two items with a cosine similarity above a threshold (e.g., 0.95) are likely semantically equivalent, even if their surface text differs. Useful for deduplicating knowledge bases or detecting near-duplicate user submissions.

Classification. Train a lightweight classifier (logistic regression, k-NN) on top of embedding vectors. With a small labelled dataset, this can achieve strong classification performance without fine-tuning the embedding model itself.

10. Common Pitfalls

Wrong pooling strategy. Already covered in detail, but worth repeating: this is the most common and most damaging error. Verify against the model card before shipping.

Missing attention mask in mean pooling. Mean pooling averages the hidden states of all tokens — but padding tokens have no semantic content and must be excluded from the average. Failing to mask padding tokens dilutes the embedding with noise, degrading quality on variable-length inputs.

No normalization. Comparing un-normalized vectors with dot product produces results that are sensitive to vector magnitude rather than direction. Always normalize, or use a cosine similarity function that handles normalization internally.

One-by-one encoding. As discussed in Section 8.1, serial single-item calls forfeit all hardware parallelism. Always batch.

Ignoring model size constraints. A 670 MB embedding model loaded alongside a 4 GB LLM on a device with 8 GB of unified memory leaves minimal headroom for the OS, the application runtime, and the KV cache. Test on the lowest-spec hardware in your target device range, not on a Mac Studio.

Mixing embedding spaces. Vectors from different models are not comparable. A similarity score between a nomic-embed-text embedding and an all-MiniLM embedding is numerically meaningless. If you change models, all stored embeddings must be recomputed.

11. Conclusion

Embeddings are not magic, and they are not opaque. They are the output of a well-defined five-stage pipeline — tokenization, lookup, transformer encoding, pooling, normalization — and understanding each stage is what separates a developer who can use the library from one who can use it correctly.

MLXEmbedders packages this pipeline into a Swift-idiomatic, actor-based API that runs entirely on-device. The architecture is clean: a registry resolves model types, a factory orchestrates loading, and an actor enforces concurrency safety. The surface area exposed to application code is small: loadModelContainer, encode, and a cosine similarity function.

What you do with that surface area is the more interesting question. Semantic search, RAG, clustering, recommendation — these are not research prototypes. They are production features, and all of them are now achievable on a device in a user’s pocket, without a network connection, without sending data to a third-party API, and without a GPU farm.

The next article in this series will build a complete RAG pipeline on top of MLXEmbedders and MLX Swift LM — combining local embeddings and local generation into a single on-device application.

ℹ️ Go Further: AI-Driven Swift Architecture
If you want to go beyond individual techniques and understand how to design entire iOS applications around on-device AI — combining Clean Architecture, Swift Concurrency, and MLX into production-ready systems — my book AI-Driven Swift Architecture covers exactly that.

References

  1. Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019. arxiv.org/abs/1908.10084

  2. Nussbaum, Z., Morris, J. X., Duderstadt, B., & Mulyar, A. (2024). Nomic Embed: Training a Reproducible Long Context Text Embedder. arxiv.org/abs/2402.01613

  3. Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., & Wei, F. (2022). Text Embeddings by Weakly-Supervised Contrastive Pre-training. arxiv.org/abs/2212.03533

  4. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. arxiv.org/abs/2005.11401

  5. Apple Inc. MLX Swift LM — MLXEmbedders. GitHub, 2024. https://github.com/ml-explore/mlx-swift-lm

  6. Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE TPAMI. arxiv.org/abs/1603.09320