Local LLMs on Apple Silicon, Part 1: From Compatibility to Your First Local Chat

Cloud APIs put a frontier model behind a single HTTPS call. That convenience is hard to beat, and for most production workloads it remains the right choice. But something has shifted over the last couple of years: the gap between “what a hosted model can do” and “what a model running on your laptop can do” has narrowed enough that local inference is no longer a curiosity. For developers, especially those of us building on Apple Silicon, it has become a serious option.

This is the first part of a series on running LLMs locally on Mac. Later parts will dig into Ollama, MLX Swift, and hands-on model experimentation. Here we’ll stay deliberately practical: why you might want to run models locally, how to figure out what your machine can actually handle, what quantization really means once you start browsing model catalogs, and finally a hands-on first run with LM Studio.

Part 1: Why Run AI Locally?

Cloud AI vs Local AI

The two approaches solve the same problem in fundamentally different ways.

Dimension	Cloud AI	Local AI
Where inference runs	Provider’s datacenter	Your own machine
Where prompts go	Sent over the network	Stay on disk and in RAM
Cost model	Per-token, recurring	One-time download, then free
Latency	Network round-trip + queueing	Local, bounded by hardware
Availability	Requires connectivity and uptime	Works offline
Model choice	Whatever the provider exposes	Anything that fits in memory
Top-end capability	Frontier models, very large context	Constrained by your hardware
Operational burden	None	You manage models, runtimes, updates

Neither column is strictly better. They’re optimized for different things.

What you gain by going local

A few advantages are obvious; others become clear only after you’ve worked with local models for a while.

Privacy and data ownership. Prompts, attachments, and intermediate outputs never leave the machine. For anything touching client data, internal code, or unreleased product work, this matters a lot.
Offline usage. Flights, trains, conference Wi-Fi, sandboxed environments. None of it stops you.
Lower latency for short prompts. No DNS, no TLS, no provider queue. On Apple Silicon, the first token often appears faster than a typical cloud call.
No recurring API costs. Once the weights are on disk, you can run a million prompts at no marginal cost beyond electricity.
Full control over the model. Pick the size, the quantization, the runtime, the sampling parameters. Swap models per task.
Freedom to experiment. Try fifteen models in an afternoon. Compare them on your own prompts. Break things without worrying about a bill.
A genuine developer playground. If you’re building features that use LLMs (summarization, classification, code assistance, on-device agents), having a model you can poke at directly shortens the feedback loop dramatically.

What you give up

Honesty is more useful than enthusiasm here.

Hardware requirements. Models need RAM, and a lot of it. The bigger the model, the more memory you need, and the more pressure you put on the rest of the system.
Inference speed scales with your machine. A base-model MacBook Air will run small models comfortably, but it won’t match an M-series Max chip on larger ones.
Large models can be impractical. A 70B-parameter model in full precision is out of reach for most laptops. Even quantized, it asks a lot.
Choosing the right quantization matters. The same model can run beautifully or terribly depending on which variant you pick. We’ll come back to this.

But before downloading a model, an important question comes up: can your machine actually run it?

Part 2: Discovering Can I Run AI

A common first mistake is downloading a model that sounds impressive, watching it swap to disk, and concluding that local AI “doesn’t work on Mac.” It works. You just picked the wrong model for the hardware.

canirun.ai is a small but genuinely useful tool for avoiding exactly that. You tell it what machine you have, and it tells you which models are realistic.

In practice, it helps you:

Verify whether a given model is feasible on your hardware
Filter the catalog so you only see models that fit
Get a realistic compatibility estimate rather than a marketing claim
Understand the memory footprint at different quantization levels
Explore what each model is actually good at

The site has two modes:

A simplified mode that’s friendly if you’re new to local models
An advanced mode that surfaces the more technical details: memory estimates, quantization variants, context window, and so on

It’s particularly well-suited to Apple Silicon users because unified memory changes the usual rules of thumb. On a discrete-GPU PC, VRAM is the bottleneck and you compare against that. On a Mac, the GPU shares system memory, so a 32 GB M2 Max can comfortably host models that would be awkward on a 12 GB discrete GPU. Tools that don’t account for this often underestimate what Apple Silicon can do.

A concrete example: MacBook Pro M2 Max, 32 GB

That’s a common, well-balanced developer machine. With 32 GB of unified memory, you can realistically expect:

3B to 8B models to run very comfortably, with fast inference and room for a large context window. This is the sweet spot for most everyday tasks.
13B models to run well in quantized form (Q4 / Q5), with decent speed.
30B to 34B models to be usable in aggressive quantizations (Q3 / Q4), with slower inference and tighter memory headroom.
70B models to be out of reach in practice. They fit only in heavily quantized form, leave little room for context, and inference is slow enough that the experience becomes frustrating.

Bigger isn’t automatically better. A well-chosen 8B model that runs fast and leaves headroom for a long context will beat a struggling 30B model for almost every interactive workflow.

The main interface

Can I Run AI, main interface

The main interface is deliberately uncluttered. You start by picking your machine. The site has presets for common Apple Silicon configurations as well as PC hardware. From there, every model in the catalog is scored against your setup, so the list is sorted by what’s actually likely to work for you.

You can filter by category (general-purpose, code, vision, small models for edge use, and so on), narrow by size, and quickly see which models are worth your time. The UX leans toward “approachable” rather than “exhaustive,” which suits the use case: you want to make a decision in a couple of minutes, not read a spec sheet.

Model details

Clicking into any model opens a detail view with the information you actually need before downloading:

Model size (parameter count)
Context window
Category (general, code, reasoning, multimodal, etc.)
An estimated performance level on your machine
Links to the model on HuggingFace
Links to the matching Ollama entry where available
Download counts and community signal
Release date

It’s the kind of consolidated view you’d otherwise piece together from three or four tabs.

Can I Run AI, Llama 3.1 8B model detail page

Take Llama 3.1 8B as an example. The detail page summarizes it as:

“Meta’s versatile 8B: great quality/speed ratio.”

Alongside that, you get the HuggingFace and Ollama links, around 7.3M downloads, 5.5K likes, and a 128K context window.

A few of those numbers deserve a translation:

An 8B model has roughly 8 billion parameters. In rough terms, that puts it in the “small but genuinely capable” tier: fast on modern Macs, strong enough for summarization, classification, drafting, and a lot of everyday developer tasks, but not at the level of frontier hosted models for hard reasoning.
A 128K context window means the model can attend to about 128,000 tokens at once, long enough to fit a sizable codebase chunk, a full document, or a long conversation. Worth noting: just because the window exists doesn’t mean using it is free. Larger contexts cost more memory and slow inference down, sometimes significantly.
Quality / speed tradeoff is the lens to keep in mind. An 8B model won’t beat a 70B model on the hardest prompts, but it will respond several times faster and leave room for the rest of your workflow. For interactive tools, that often wins.

Going Further: llmfit for Terminal-First Developers

Can I Run AI is great when you want a quick answer in a browser tab. Once you start caring about the why behind those answers (exact memory footprints, expected throughput, how a given quantization will behave on your specific chip), you eventually want a more technical lens. That’s where llmfit comes in.

It’s worth being upfront about what llmfit is and isn’t. Unlike Can I Run AI, it’s entirely terminal-based and prioritizes technical analysis over UI polish. There’s no friendly card layout, no marketing copy, no animated scoring. The interface is denser and less approachable, and that’s a deliberate tradeoff. In exchange, you get the kind of detail that’s genuinely useful when you’re past the “will it run?” stage and into the “how well, and why?” stage.

If you’re comfortable in a terminal and want deeper performance insights, llmfit pays off quickly. If you’re new to local AI, you’ll probably have a better time starting elsewhere and coming back to this once the questions get sharper.

llmfit, terminal-based model compatibility analysis

What it actually does

llmfit scores models along several dimensions rather than collapsing everything into a single percentage:

Model quality: how the model itself stacks up on community evaluations
Inference speed: expected throughput on your specific hardware
Hardware suitability: how well the model maps to your chip and memory
Context window support: whether long-context usage is realistic, not just nominally supported

It detects your environment automatically. On a typical developer setup, that includes:

NVIDIA GPUs
AMD GPUs
Intel Arc GPUs
Apple Silicon (M-series chips and their unified memory)
Available system RAM
CPU capabilities and instruction sets

The detection step is genuinely convenient: you don’t pick a preset, you just run the tool and it figures out what it’s looking at.

Hardware simulation

One feature stands out and is worth highlighting on its own: hardware simulation.

You can ask llmfit to evaluate models as if you were running on a different machine. Curious whether your colleague’s M4 Pro with 48 GB would handle a model that struggles on your M2 with 16 GB? Considering an upgrade and want to know what becomes realistic at 64 GB? Building a feature for a customer whose hardware you know? Simulate the target configuration and get a grounded answer before spending a cent.

For developers shipping LLM-powered features to users on a range of machines, this is unusually useful. Most tools assume you only care about the box you’re sitting at.

Real benchmarks, not just specs

llmfit also aggregates community benchmarks rather than relying purely on theoretical hardware capabilities. The numbers it surfaces include:

Tokens per second under real inference
Time to first token (how long before the model starts responding)
VRAM / unified memory usage during actual runs

Theoretical specs only get you so far. Two chips with the same memory bandwidth on paper can deliver very different inference throughput once you factor in the runtime, the quantization, the context length being used, and how the model is laid out. Real-world numbers from people running the same model on the same hardware cut through that noise. They tell you what to actually expect, not what’s nominally possible.

Runtime integrations

llmfit isn’t trying to be your inference engine. It’s an analysis layer that points at the runtimes you’re likely already using:

Ollama
llama.cpp
MLX (the natural fit on Apple Silicon)

That keeps it useful regardless of how you prefer to actually run models day-to-day.

Installing on macOS

The simplest path on a Mac is via Homebrew:

brew install llmfit

That’s it. If you’d rather build from source or follow the upstream instructions, the repository is at github.com/AlexsJones/llmfit.

Who is this for, honestly

Beginners will probably get more value out of graphical tools like LM Studio or Can I Run AI. The feedback loop is faster and the friction is lower.
Advanced developers, particularly those benchmarking models, picking quantizations carefully, or shipping local-AI features to varied hardware, will appreciate llmfit’s precision and terminal-first workflow.

Both kinds of tools have a place. Use whichever matches the question you’re actually trying to answer.

Quantization, in plain terms

Once you start browsing models, you’ll notice the same model listed several times with cryptic suffixes: FP16, Q8, Q6, Q4_K_M, and so on. These are quantizations: different compressed versions of the same underlying weights.

The intuition is straightforward. A model’s weights are numbers. By default they’re stored in relatively high precision (say, 16-bit floats). Quantization stores them in fewer bits (8, 6, 5, 4, sometimes even less), which shrinks the model on disk and in memory. The arithmetic during inference is adjusted accordingly.

This isn’t an MLX-specific idea. The same technique shows up across the ecosystem:

GGUF files (the format used by llama.cpp and Ollama)
Ollama model variants
MLX and MLX-community model repositories
llama.cpp itself
Various other runtimes and inference engines

The same model can therefore exist in several flavors:

Variant	Roughly what it means	Typical effect
FP16	Full half-precision weights	Largest, highest fidelity
Q8	8-bit quantization	Much smaller, quality very close to FP16
Q6	6-bit quantization	Smaller again, still high quality
Q5	5-bit quantization	Common middle ground
Q4	4-bit quantization	Aggressive, popular on consumer hardware
Q3 / Q2	Very aggressive quantization	Smallest, noticeable quality degradation

The trend is consistent:

The more aggressive the quantization, the less RAM the model uses.
Inference often gets faster, because the runtime moves fewer bytes around.
Quality degrades gradually: usually imperceptible at Q8 or Q6, often acceptable at Q4, increasingly visible below that.

There’s no universally “correct” choice. For an 8B model on a 32 GB Mac, a Q5 or Q6 variant is usually a sweet spot: enough headroom for a long context, fast enough to feel interactive, and quality you’d struggle to distinguish from the full-precision version on most tasks. For a 30B model on the same hardware, you’ll likely need Q4, and you accept the tradeoff because the alternative is not running the model at all.

The honest summary: quantization is what makes local LLMs practical on laptops. Pick the smallest variant that still feels good for your use case, and don’t be afraid to test a couple of options on real prompts before committing.

If you want a deeper, hands-on look at the topic, including how to actually convert and quantize a model yourself using MLX and Hugging Face, see Quantization in LLMs: How to Run AI on Your iPhone Without Burning It.

Part 3: Getting Started with LM Studio

Now that we have a clearer picture of what local AI asks of your hardware and how to figure out which models are realistic, it’s time to actually run one. The easiest place to start, by a comfortable margin, is LM Studio.

At the time of writing, LM Studio is at version 0.4.13 on macOS, with native support for Apple Silicon, including the full M-series lineup. There’s no separate Rosetta build to think about and no GPU drivers to install. You download the app, open it, and you’re roughly five minutes away from your first locally-run conversation.

The reason LM Studio is usually the first tool recommended to newcomers is its UX. Where most of the local-AI ecosystem grew up around terminals and config files, LM Studio offers a polished graphical interface that turns “download and run an LLM” into something close to “download and run any other Mac app.” The mental model is familiar even if the underlying technology isn’t.

If you want to go deeper, I also discussed LM Studio and running LLMs locally on Apple Silicon with Adrien Grondin in a recent podcast episode:

Under the hood, LM Studio pulls models from Hugging Face, the largest open ecosystem for machine learning and LLM distribution. We touched on Hugging Face briefly in the MLX Swift series, but LM Studio hides most of the friction: no Python environment, no manual file management, no hunting through repository directories. You search, you click, you download.

This is also where the tools from the previous sections start to pay off. Before downloading anything from Hugging Face, Can I Run AI or llmfit can tell you whether a given model is realistic on your specific Mac and roughly how much memory it will need. Five minutes of compatibility-checking up front saves a lot of “why is my fan at full speed and the model still loading” later.

Downloading a model

For this walkthrough, we’ll use a Kiwi 13B GGUF model.

LM Studio, downloading a Kiwi 13B GGUF model

The model name carries more information than it looks. Two parts are worth unpacking.

The “13B” refers to the model’s parameter count: roughly 13 billion. As a rough rule of thumb, larger parameter counts tend to improve reasoning, instruction-following, and text quality. They also require proportionally more RAM and compute. A 13B model is a noticeable step up from 7B/8B in capability, and a noticeable step up in resource usage. On a 32 GB Mac, it’s well within reach in quantized form; on 16 GB, it gets tight.

The “GGUF” is just as important. GGUF is a model file format designed specifically for local inference. It’s used by most of the tools you’re likely to encounter in this space:

llama.cpp
LM Studio
Ollama
various Apple Silicon-oriented runtimes

GGUF is optimized for local execution and pairs particularly well with quantized models. In practice, when you browse Hugging Face for a model to run on your Mac, you’ll almost always be looking at a GGUF file.

This is where quantization, the topic we covered earlier, stops being theory and starts mattering in a very concrete way. Instead of the full FP16 weights, GGUF models are typically distributed in several quantized variants:

Each variant trades a small amount of quality for a meaningful reduction in memory footprint and, often, faster inference. For most everyday use, the quality difference between FP16 and Q5/Q6 is hard to spot in real prompts, while the memory savings make the difference between “runs comfortably” and “doesn’t fit.”

For Apple Silicon users in particular, quantized GGUF models hit a useful sweet spot between:

Quality: close enough to the full-precision model that most users won’t notice in practice
Memory usage: small enough to leave headroom for context and the rest of the system
Inference speed: fast enough to feel interactive on M-series chips

For a 13B model on a 32 GB Mac, a Q4 or Q5 variant is a reasonable default. Start there; you can always re-download a less aggressive quantization later if you find the quality lacking on your prompts.

Starting a chat session

Once the download finishes, LM Studio gets out of your way.

You can jump straight into experimentation by clicking “Use in New Chat”.

LM Studio, launching a local chat session

That opens a local chat session running entirely on your machine. No cloud API. No subscription. No remote inference server. The prompt you type goes to a model loaded into your Mac’s unified memory and the response comes back from the same place. Nothing crosses the network. Not even telemetry, if you don’t want it to.

It’s worth pausing on what that actually means. You now have your own local AI assistant, sitting on-device, available offline, and bound only by your hardware. It won’t match a frontier hosted model on the hardest tasks, but for a wide range of practical work (drafting, summarizing, classifying, refactoring, exploring ideas), it’s genuinely useful.

What stands out, especially on modern Apple Silicon, is how unremarkable this has become. Running a 13B model locally with snappy first-token latency would have required workstation-class hardware just a few years ago. Today it runs on a laptop you’d happily carry to a coffee shop, on battery, without anyone noticing the fan. That shift is what makes the local-AI ecosystem worth paying attention to as a developer: not the novelty of any individual tool, but the fact that the floor of what’s reasonable on consumer hardware has moved so far up.

Chatting with the model

Once the chat window is open, the experience feels very close to a hosted assistant. Everything is just happening on your machine.

LM Studio, chat interface and model settings

In practice you can do the same things you’d reach for a cloud assistant for:

ask questions
generate code
summarize and rewrite text
iterate on prompts
probe reasoning
evaluate how the model behaves on your own use cases

What sets LM Studio apart from a thin chat wrapper is that it doesn’t hide the knobs most cloud products keep behind their API. The sampling parameters and system instructions that shape every response are surfaced directly in the sidebar.

Understanding model parameters

Modern LLMs aren’t deterministic by default. Their output depends on a handful of sampling parameters that control how the model picks the next token at each step. LM Studio exposes these controls inline.

LM Studio, sampling parameter settings

The three you’ll touch most often:

Temperature: scales the randomness of token selection.
Top-K sampling: restricts the choice to the K most probable tokens at each step.
Top-P sampling: restricts the choice to the smallest set of tokens whose cumulative probability exceeds P (also called nucleus sampling).

These get a deeper treatment in Chapter 6 of AI-Driven Swift Architecture, the book series co-authored with Dave Poirier.

For everyday use the pragmatic intuition is enough:

Lower temperature → more deterministic, predictable output. Good for code generation, classification, structured extraction.
Higher temperature → more variation and creativity. Good for brainstorming, creative writing, ideation.
Top-K caps the candidate pool at a fixed number of options. Smaller K means tighter, more conservative output.
Top-P does the same job dynamically based on probability mass, often producing more natural results than a hard top-K cap.

In practice, these parameters can change the personality of a model as much as the prompt does. The same model at temperature=0.2 and temperature=1.2 behaves like two different assistants. Spending a few minutes finding sane defaults for your use case pays off quickly: low temperature for code, higher temperature for ideation.

System prompts and model instructions

The other lever LM Studio exposes, and arguably the most useful one, is the system prompt.

LM Studio, configuring model instructions

A system prompt is a piece of text the model sees before the conversation starts. It sets the frame: the role, the tone, the constraints, the output format. The same model with two different system prompts will feel like two different tools.

A few examples worth trying:

“You are a senior iOS engineer. Answer in 2-3 sentences. Prefer Swift 6 idioms. No marketing language.”
“Only output Swift code, no prose. Include type annotations. Comment only where the intent isn’t obvious.”
“Explain like a teacher. Build intuition before introducing terminology.”

This is where local experimentation starts to feel genuinely valuable. You can iterate on system prompts as fast as you can type: no API rate limits, no metered tokens, no waiting on a cold endpoint. Combined with a quantized model running on Apple Silicon, you end up with a surprisingly productive environment for prototyping LLM-powered features before committing to a hosted backend.

What’s next

We’ve covered the why of local AI, the how of figuring out what your Mac can handle, the what behind quantization, and a first hands-on run with LM Studio. That’s enough to start exploring on your own.

In the next part, we’ll move from clicking buttons to driving a local model from the command line and from code, starting with Ollama, the most common bridge between “I have a model running” and “I’m using it in a real workflow.”

Part 1: Why Run AI Locally?#

Cloud AI vs Local AI#

What you gain by going local#

What you give up#

Part 2: Discovering Can I Run AI#

A concrete example: MacBook Pro M2 Max, 32 GB#

The main interface#

Model details#

Going Further: llmfit for Terminal-First Developers#

What it actually does#

Hardware simulation#

Real benchmarks, not just specs#

Runtime integrations#

Installing on macOS#

Who is this for, honestly#

Quantization, in plain terms#

Part 3: Getting Started with LM Studio#

Downloading a model#

Starting a chat session#

Chatting with the model#

Understanding model parameters#

System prompts and model instructions#

What’s next#