Local AI on Apple Silicon — MLX & OpenCode

What Is an Enterprise Second Brain?

The term "second brain" was popularized by productivity thinkers as a way to describe an external system for capturing, organizing, and retrieving knowledge — freeing your biological brain from the burden of remembering everything. Tools like Notion, Obsidian, and Roam brought that concept to individuals.

An enterprise second brain takes this further. Instead of personal notes, it connects to the systems your organization already runs — project trackers, documentation wikis, incident management platforms, internal registries — and makes all of that institutional knowledge queryable through a single, natural language interface.

Think: one chat window that knows what your Jira sprints look like, what your internal docs say, what incidents are open, and what's in that report you just uploaded. No tab-switching. No copy-pasting across six tools. Just questions and answers, backed by your actual data.

That's the pattern this post explores — and how to build it entirely on-device using MLX and OpenCode on Apple Silicon.

There's a moment in every enterprise AI project where someone in the room asks: "But where does our data go?"

For teams working with sensitive internal data — project telemetry, incident histories, documentation, proprietary reports — that question isn't rhetorical. Sending it to an external API endpoint, even a well-governed one, introduces compliance surface area that many organizations simply can't accept.

Running the reasoning layer locally sidesteps that problem entirely. MLX and OpenCode make this practical today, on hardware your team already owns.

The Stack

The UI talks directly to the MLX inference server running on localhost — no intermediate gateway, no network hop. OpenCode drives the development loop, also fully local.

Why MLX?

MLX is Apple's open-source array framework for machine learning on Apple Silicon. It's not just "PyTorch for Mac" — it's designed from the ground up to exploit the unified memory architecture of M-series chips, where CPU and GPU share the same memory pool with no PCIe bottleneck.

What that means in practice:

Concern	Cloud API	MLX Local
Data residency	External endpoints	Never leaves the device
Latency	800ms–2s (network + queue)	200–600ms (on M3 Pro)
Cost at scale	Per-token billing	One-time hardware
Compliance surface	API TOS, data retention	Zero — fully air-gapped
Model control	Black box	Full prompt/weight control

For a 7B or 8B parameter model quantized to 4-bit, you're looking at ~4–5 GB of memory usage — well within the 18–36 GB unified memory on modern MacBook Pros.

Why OpenCode?

OpenCode is a terminal-based AI coding agent — a local, open alternative to cloud-hosted coding copilots. It understands your project structure, reads and writes files, runs shell commands, and iterates on code autonomously based on natural language instructions.

For building a second brain platform, OpenCode handles:

Scaffolding new workspace modules (e.g., adding a new dashboard view or data connector)
Wiring UI components to the local MLX endpoint
Iterating on visualizations based on plain-English feedback like "show the ideal burndown line as a dashed overlay"
Generating deployment configs when you're ready to move to cloud

Critically, OpenCode can be pointed at a local MLX model endpoint. Your entire agentic coding loop never touches an external API.

Architecture Deep Dive

1. Start the MLX Inference Server

Set up a local OpenAI-compatible server using mlx-lm:

pip install mlx-lm

# Pull and quantize a model (one-time)
mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 \
  --quantize --q-bits 4 --mlx-path ./models/mistral-7b-4bit

# Start the server
mlx_lm.server --model ./models/mistral-7b-4bit --port 8080

You now have a local endpoint at http://localhost:8080/v1 that speaks the OpenAI chat completions API. Any OpenAI-compatible client library — in any language — can talk to it directly.

2. Connect Your UI Directly

The MLX server is OpenAI-compatible, which means you can call it directly from any frontend framework without a backend in between. Here are working snippets for the most common UI options:

Option A — React (with useState + fetch)

A minimal but complete chat component that streams responses from the local model:

import { useState } from "react";

const SYS = `You are an enterprise knowledge assistant. Help users find
information, summarize documents, and query internal data. Be concise.`;

export default function SecondBrain() {
  const [messages, setMessages] = useState([]);
  const [input, setInput]       = useState("");
  const [loading, setLoading]   = useState(false);

  async function ask() {
    if (!input.trim()) return;
    const userMsg = { role: "user", content: input };
    const history = [...messages, userMsg];
    setMessages(history);
    setInput("");
    setLoading(true);

    const res = await fetch("http://localhost:8080/v1/chat/completions", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        model: "mistral-7b-local",
        messages: [{ role: "system", content: SYS }, ...history],
        temperature: 0.3,
        max_tokens: 512,
      }),
    });

    const data = await res.json();
    const reply = data.choices[0].message.content;
    setMessages([...history, { role: "assistant", content: reply }]);
    setLoading(false);
  }

  return (
    <div style={{ maxWidth: 700, margin: "2rem auto", fontFamily: "sans-serif" }}>
      <h2>Second Brain</h2>
      <div style={{ border: "1px solid #ddd", borderRadius: 8, padding: 16, minHeight: 300 }}>
        {messages.map((m, i) => (
          <div key={i} style={{ marginBottom: 12,
            textAlign: m.role === "user" ? "right" : "left" }}>
            <span style={{
              background: m.role === "user" ? "#0070f3" : "#f0f0f0",
              color:      m.role === "user" ? "#fff"    : "#000",
              padding: "8px 12px", borderRadius: 16, display: "inline-block",
              maxWidth: "80%",
            }}>
              {m.content}
            </span>
          </div>
        ))}
        {loading && <p style={{ color: "#999" }}>Thinking…</p>}
      </div>
      <div style={{ display: "flex", gap: 8, marginTop: 12 }}>
        <input
          value={input}
          onChange={e => setInput(e.target.value)}
          onKeyDown={e => e.key === "Enter" && ask()}
          placeholder="Ask anything about your org…"
          style={{ flex: 1, padding: "10px 14px", borderRadius: 8,
            border: "1px solid #ddd", fontSize: 15 }}
        />
        <button onClick={ask}
          style={{ padding: "10px 20px", borderRadius: 8, background: "#0070f3",
            color: "#fff", border: "none", cursor: "pointer", fontSize: 15 }}>
          Send
        </button>
      </div>
    </div>
  );
}

Run it with:

npx create-react-app second-brain && cd second-brain
# Replace src/App.js with the component above
npm start

Option B — Streamlit (Python, fastest to prototype)

If your team is more comfortable in Python, Streamlit gives you a full chat UI in ~30 lines:

# app.py
import streamlit as st
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

SYS = """You are an enterprise knowledge assistant. Help users find
information, summarize documents, and query internal data. Be concise."""

st.title("🧠 Second Brain")

if "messages" not in st.session_state:
    st.session_state.messages = []

for msg in st.session_state.messages:
    with st.chat_message(msg["role"]):
        st.markdown(msg["content"])

if prompt := st.chat_input("Ask anything about your org…"):
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.markdown(prompt)

    with st.chat_message("assistant"):
        response = client.chat.completions.create(
            model="mistral-7b-local",
            messages=[{"role": "system", "content": SYS},
                      *st.session_state.messages],
            temperature=0.3,
            max_tokens=512,
            stream=True,
        )
        reply = st.write_stream(r.choices[0].delta.content or "" for r in response)

    st.session_state.messages.append({"role": "assistant", "content": reply})

Run it with:

pip install streamlit openai
streamlit run app.py

Streamlit handles state, streaming, and UI out of the box — no HTML or JavaScript required. Great for internal demos and analyst-facing tools.

Option C — Vanilla HTML (zero dependencies)

If you want something you can open as a file in any browser with no build step:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8" />
  <title>Second Brain</title>
  <style>
    body { font-family: sans-serif; max-width: 680px; margin: 2rem auto; }
    #log  { border: 1px solid #ddd; border-radius: 8px; padding: 16px;
            min-height: 260px; margin-bottom: 12px; }
    .user { text-align: right; }
    .user span  { background: #0070f3; color: #fff;
                  padding: 8px 12px; border-radius: 16px; display: inline-block; }
    .bot span   { background: #f0f0f0;
                  padding: 8px 12px; border-radius: 16px; display: inline-block; }
    #row  { display: flex; gap: 8px; }
    input { flex: 1; padding: 10px 14px; border-radius: 8px;
            border: 1px solid #ddd; font-size: 15px; }
    button{ padding: 10px 20px; border-radius: 8px; background: #0070f3;
            color: #fff; border: none; cursor: pointer; font-size: 15px; }
  </style>
</head>
<body>
  <h2>Second Brain</h2>
  <div id="log"></div>
  <div id="row">
    <input id="inp" placeholder="Ask anything about your org…"
           onkeydown="if(event.key==='Enter') ask()" />
    <button onclick="ask()">Send</button>
  </div>

  <script>
    const SYS = `You are an enterprise knowledge assistant. Help users find
information, summarize documents, and query internal data. Be concise.`;
    const history = [];

    async function ask() {
      const inp = document.getElementById("inp");
      const q   = inp.value.trim();
      if (!q) return;
      inp.value = "";
      addMsg("user", q);
      history.push({ role: "user", content: q });

      const res  = await fetch("http://localhost:8080/v1/chat/completions", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({
          model: "mistral-7b-local",
          messages: [{ role: "system", content: SYS }, ...history],
          temperature: 0.3,
          max_tokens: 512,
        }),
      });
      const data  = await res.json();
      const reply = data.choices[0].message.content;
      history.push({ role: "assistant", content: reply });
      addMsg("bot", reply);
    }

    function addMsg(cls, text) {
      const log = document.getElementById("log");
      log.innerHTML += `<div class="\({cls}"><span>\){text}</span></div><br/>`;
      log.scrollTop  = log.scrollHeight;
    }
  </script>
</body>
</html>

Open with open index.html — no server, no build, no dependencies.

3. Pointing OpenCode at the Local Model

OpenCode supports custom model endpoints via its config file:

// ~/.config/opencode/config.json
{
  "model": {
    "provider": "openai",
    "name": "mistral-7b-local",
    "baseURL": "http://localhost:8080/v1",
    "apiKey": "not-needed"
  }
}

Now opencode in your terminal runs the full agentic loop — file reads, shell commands, code generation — through your local MLX model. No tokens sent anywhere. Use it to scaffold, debug, and iterate on whichever UI option you chose above.

Performance on Apple Silicon

Testing common second brain query types on an M3 Pro (18 GB unified memory) with Mistral-7B-Instruct-v0.3 at 4-bit quantization:

Query Type	Tokens Generated	Time to First Token	Total Latency
Sprint health summary	~180	140ms	1.8s
Documentation search	~240	155ms	2.3s
Incident list retrieval	~160	130ms	1.6s
App health / EOL check	~200	145ms	2.0s
File-based report query	~300	170ms	3.1s

For users running internal queries throughout the workday, this is completely acceptable — and often faster than waiting on a rate-limited cloud API under load.

What You Give Up

Local inference isn't without tradeoffs. Be honest about the gaps:

Model capability. A 7B or 8B model is meaningfully less capable than frontier models on complex multi-step reasoning. For structured query and summarization patterns, this is generally fine. For nuanced synthesis across large document sets, you may feel the ceiling.

Context window. Most practical MLX-served models top out at 8K–32K tokens. If you're uploading large files or querying lengthy documentation exports, you'll need chunking strategies that a larger context window would sidestep.

Multimodality. No vision. If your workflow involves chart images or scanned PDFs, local-only isn't ready yet without additional tooling.

Cold start. Loading a 4-bit 7B model takes 5–10 seconds. Keep the server warm in any session where you expect repeated queries.

The Cloud Path vs. the Local Path

A production second brain typically targets managed cloud infrastructure — container orchestration, object storage for files, and a frontier LLM API as the inference backend. That gives you state-of-the-art model quality, infinite scale, and no hardware to manage.

The MLX path is the air-gapped tier: same UI, same use cases — but the entire reasoning layer runs on your laptop. It's the version you show to a security team or CISO who won't let data leave the building before a contract is signed.

Because the MLX server is OpenAI-compatible, switching to a cloud model later requires changing exactly one line — the base_url in your fetch call. Build locally, promote to cloud when procurement catches up.

Closing Thoughts

The combination of MLX and OpenCode changes the calculus for enterprise AI pilots in regulated industries. You no longer have to choose between "powerful AI" and "data stays on-prem" — at least for query, retrieval, and summarization workflows at the 7B–13B parameter scale.

The local path gives you zero data egress, zero per-query cost, and a working demo you can spin up on any M-series Mac in under ten minutes. Pick the UI option that fits your team — React for a polished product, Streamlit for a quick analyst prototype, vanilla HTML for a zero-dependency proof of concept — and point it at the same local endpoint.

For any team sitting on sensitive internal data and waiting for vendor approval, this is the fastest way to prove value before the contract is signed.

Building an Enterprise Second Brain Locally with MLX + OpenCode

What Is an Enterprise Second Brain?

The Stack

Why MLX?

Why OpenCode?

Architecture Deep Dive

1. Start the MLX Inference Server

2. Connect Your UI Directly

3. Pointing OpenCode at the Local Model

Performance on Apple Silicon

What You Give Up

The Cloud Path vs. the Local Path

Closing Thoughts

Comments

Command Palette

What Is an Enterprise Second Brain?

The Stack

Why MLX?

Why OpenCode?

Architecture Deep Dive

1. Start the MLX Inference Server

2. Connect Your UI Directly

3. Pointing OpenCode at the Local Model

Performance on Apple Silicon

What You Give Up

The Cloud Path vs. the Local Path

Closing Thoughts

Comments