MLX on One surfaces

TL;DR: Concept paper. On-device inference is not a feature of One — it is a consent-architecture requirement, and MLX is the natural runtime to deliver it on Apple silicon. No implementation exists; this page exists to be argued with.

Status as of 2026-05-07: see body.

Relations

One — the platform whose surfaces MLX would power.
Kai — the orchestrator whose intent classification benefits most from on-device inference.
Nav — the privacy guardian whose rules engine should never require a network round-trip.
PCHP — the consent protocol that on-device inference structurally strengthens.
BYOA — MLX is the on-device half of BYOA; user-authorized cloud model choice is the other half.
Three-layer architecture — MLX lives at the platform layer, beneath PCHP, atop the OS.
iBrokerage — the first consumer surface where on-device inference would matter (Gmail → 60-second Aha Moment).
Apple frame — "What would Apple do?" on inference runtime on Apple silicon answers itself.

Status: concept only. No code. No benchmarks. No selected models. This page is a position paper, published early so the developer community — and Manish's own engineering team — can argue with it before any line of code is written. Read What this is not before quoting this page.

MLX Runtime Topology (editable source) — Runtime topology editable source
MLX Runtime Topology — Runtime topology diagram

Why this page exists

Most personal-AI products treat on-device inference as a performance optimization — something you do when the user's phone happens to have enough silicon, with a cloud fallback when it doesn't. Hussh inverts that frame.

For One, on-device inference is a consent-architecture requirement. If a user's data never leaves the device for the inference step, the consent surface for that operation collapses to a local-only event: no DAT issuance, no Transparency Log entry for an outbound network read, no third-party model provider in the chain of custody. That is the strongest possible posture under PCHP, because the cleanest way to honor "your data, your business" is for the data to never go anywhere.

This is not the same argument as "on-device is faster" or "on-device is cheaper." Speed and cost are nice. Consent integrity is the actual point.

The argument, in three steps

1. PCHP gets stronger as the data path gets shorter

Every PCHP handshake has a blast radius — the set of parties that touch user data during a single operation. A cloud-LLM call has a blast radius that includes the model provider, their infrastructure, their logging, and any sub-processors they declare. Even with airtight contracts and zero-retention claims, the blast radius is non-zero and the user has to trust it is what the provider says.

A local inference call on the user's own device has a blast radius of one: the device. The user already trusted the device when they bought it and unlocked it with their face. There is no second trust decision to make.

This is the structural reason on-device inference is not optional for One's most sensitive surfaces — it's the only configuration in which the consent semantics are simple enough to be honest.

2. The Apple-silicon installed base is now the dominant personal compute surface for One's target user

One's target user runs an iPhone, often an iPad, often a Mac. M-series silicon (M1 through M5 as of 2026) and A-series silicon from A17 Pro forward all share a unified memory architecture and a Metal-native GPU. On the M5 generation, Apple added GPU Neural Accelerators that further improve LLM time-to-first-token and tokens-per-second; using them requires macOS 26.2 or later [Apple ML Research, 2025].

This is the fleet One is built for. It is not a fallback fleet. Designing an inference layer that treats Apple silicon as a second-class target is, for One, designing for the wrong product.

3. MLX is the runtime that maps Apple silicon's actual capabilities to a usable API

MLX is Apple's open-source array framework, announced more broadly at WWDC 2025 and now the runtime under projects including Ollama 0.19+ on Apple silicon [Ollama Blog, March 2026]. The properties that matter for One:

Unified memory by default. Operations run on CPU or GPU without explicit memory transfer. For a personal-data agent that interleaves preprocessing (Python/Swift code touching user records) with model calls, this removes a class of bugs and a class of latency.
Metal-native, Apple-silicon-tuned. Independent November 2025 benchmarking on M2 Ultra found MLX achieved the highest sustained generation throughput among local Apple-silicon frameworks compared (MLX, MLC-LLM, Ollama, llama.cpp, PyTorch MPS) [arXiv 2511.05502].
Swift + Python bindings. One's stack is Capacitor 8 + Next.js 16 on the cross-platform side and native Swift on iOS. MLX speaks both. Core ML, by contrast, is model-format-locked and harder to round-trip with HuggingFace artifacts. llama.cpp is cross-platform but not Metal-first by design.
Multimodal-ready. Independent work on vllm-mlx shows content-based prefix caching for vision embeddings achieving up to 28× speedup on repeated image queries on Apple silicon [arXiv 2601.19139]. One's roadmap includes document and screen capture; this is the right substrate.
Distributed inference primitives. mx.distributed.all_sum() and friends allow multi-machine deployment over Ethernet/Thunderbolt — relevant for a Mac-anchored "home base" model serving an iPhone in the same household, with no public-internet hop.

None of this is unique-to-Hussh insight. It is, however, the substrate decision Hussh has not yet committed to in writing — and writing it down is the point of this page.

Surfaces of One where MLX would change the design

Listed in priority order for the consent-integrity argument, not for ease of implementation. This is a thinking artifact, not a roadmap.

Kai — intent classification and tool selection

Every Kai turn opens with an intent decision: is the user asking for finance, calendar, mail, or general chat; which tool should I route to; do I need any user data to answer. That decision is structurally local — it operates on the prompt and the immediate session state, not on the user's full data graph. Doing it on-device with a small MLX-hosted model means the routing decision never leaves the phone, even for users who later opt into a cloud BYOA model for the heavy generation step.

This is the cleanest place to start. The model can be small (a few hundred million parameters, 4-bit quantized), the latency budget is tight, and the consent argument is unambiguous.

Nav decides what data leaves the device, when, to whom, under which CRT/DAT. This is exactly the operation that must never depend on a network round-trip — because the network is what Nav is gating. An on-device MLX-hosted classifier or rules-engine assist runs on the same silicon Nav already runs on, with zero exfiltration risk by construction.

If any surface of One is non-negotiably on-device, it is this one.

iBrokerage — the 60-second Aha Moment

The Aha Moment is: connect Gmail → financial picture surfaces in under sixty seconds. The hard step is parsing thousands of mail messages to extract financial signal (statements, trade confirmations, dividend notices, fee disclosures). Doing that parse with a cloud LLM means uploading every message — a maximum-blast-radius operation on maximally sensitive content, for a non-negotiable product gate.

Doing it with an MLX-hosted model on the user's own device means the messages never leave the device for the parse. The cloud-LLM step, if any, comes later — operating on already-extracted, already-consented structured data, not raw mail. This is the iBrokerage surface where MLX changes the consent calculus, not just the latency.

Private Vault — local embeddings

The Private Vault holds personal records under AES-256-GCM and Secure Enclave biometric. Building searchable embeddings of vault contents requires a model that touches the plaintext. That model must be local, and MLX is the local-embedding-model runtime on Apple silicon. There is no alternative configuration that satisfies the vault's threat model.

Open questions worth arguing with

These are open. None has a Hussh answer yet.

Model size ceiling on the iPhone fleet. What is the largest model that runs at acceptable latency on iPhone 15 Pro / 16 / 16 Pro / 17 Pro? Where do we draw the device-class line, and what is the BYOA-cloud fallback for users below it?
Quantization tradeoffs. 4-bit vs 8-bit. NVFP4 vs Q4_K_M. What is the quality cliff for personal-finance reasoning specifically, where small numerical errors are unacceptable in a way they are not for chat?
On-device personalization vs. shipped weights. Do we ever fine-tune on-device? If yes, on what signal (Kai conversation history? user corrections of Kai?), and how do we audit it under PCHP — does the model itself become user data?
MLX vs Core ML for the Apple Intelligence-adjacent surfaces. Apple Intelligence has its own on-device models that ship with the OS. What is the right division of labor — let Apple's models handle generic NLP and reserve MLX for Hussh-specific tasks (consent classification, financial parsing)?
Distributed home-base inference. Is the Mac-as-inference-server, iPhone-as-client topology useful enough to design for, or a distraction?
The cloud BYOA contract. When the user explicitly chooses a hosted model key for a heavy generation step, what is the minimal local pre-processing layer (running on MLX) that strips PII and lets the user's chosen cloud model do generic reasoning on a sanitized prompt?

If you are reading this page and you have a strong opinion on any of the six, the page exists to receive that pushback. File it as an issue against this wiki page or send it to Manish directly.

What this is not

Not a shipping plan. No timeline, no surface commitment, no model selected.
Not a benchmark. No performance numbers from Hussh — only public benchmarks from third-party research, cited above.
Not a vendor commitment. Hussh is not "an MLX shop" in the same sense some startups are "a CUDA shop." MLX is the right runtime for the Apple-silicon target. Other targets (Linux servers running cloud BYOA, Android in some hypothetical future) require different runtimes. BYOA is the architectural commitment; MLX is the consequent runtime choice for one segment of the fleet.
Not a product page for One, Kai, Nav, or iBrokerage. Those have their own canonical pages. This page is about the inference runtime under them.
Not Apple-endorsed. Hussh has no relationship with Apple beyond being an Apple-silicon-targeting independent developer. The Apple-frame default ("What would Apple do?") is a design discipline, not a partnership.

How this page should be used

If you are an engineer at Hussh: this is the substrate you build against when an on-device-vs-cloud question arises. Cite this page in design docs; argue with it in PRs.

If you are an external developer: this is Hussh's public position on personal-agent inference architecture, published before the code so you can build adjacent work — Open Source MCP servers, BYOA cloud adapters, alternative on-device runtimes — that interoperates with where Hussh is going.

If you are an investor or partner: this is one of seven or eight architectural decisions that determine whether Hussh's "your data, your business" claim is structurally true or just marketing. Read it alongside PCHP, BYOA, and the three-layer architecture.

If you are Manish reading this back later: this page is concept-only. Do not let it harden into doctrine without code. Claims here become standing rules only after they have survived implementation contact.

Enterprise architecture status refresh

Current repo truth: MLX/on-device inference remains concept-only unless checked code and tests prove a shipped path.
North-star direction: local inference should operate on scoped local envelopes or projections only after consent.
Not shipped / not implied: MLX does not receive direct raw PKM and is unrelated to partner CRM storage authority.