Karpathy's LLM OS (2023): the architecture blueprint we still use today
Read the field note below to see how we apply this pattern in real Claude Code projects.
The Karpathy Playbook #2: LLM OS
Part 2 of 6. From Software 2.0 to the March of Nines, tracing Karpathy's public thinking with FRE|Nxt Labs production commentary.
The tweets
"With many pieces dropping recently, a more complete picture is emerging of LLMs not as a chatbot, but the kernel process of a new Operating System. E.g. today it orchestrates: Input & Output across modalities (text, audio, vision), Code interpreter, ability to write & run..."
@karpathy, 28 September 2023
Six weeks later, he made it concrete:
"LLM OS. Bear with me I'm still cooking. Specs: — LLM: OpenAI GPT-4 Turbo 256 core (batch size) processor @ 20Hz (tok/s) — RAM: 128Ktok — Filesystem: Ada002"
@karpathy, 10 November 2023
And later, at a Sequoia Capital event after leaving OpenAI: "Everyone is trying to build what I like to refer to as a LLM OS."
What we heard
If Software 2.0 (entry #1 in this series) redefined how software is written, LLM OS redefined what it looks like when it runs. The analogy is the most useful mental model we've ever found for explaining multi-agent systems to people who have never built one:
| OS concept | LLM equivalent |
|---|---|
| CPU | The LLM itself |
| RAM | Context window |
| Filesystem | Vector store / long-term memory |
| Processes | Agents |
| IPC | Tool calls and agent handoffs |
| Kernel | The orchestration layer (LangGraph) |
| User space | The application (your product) |
Every architecture decision becomes obvious once the analogy clicks. You don't give every process unlimited RAM. You page memory in and out. You don't give every process root access. You scope permissions. You instrument the kernel. Every syscall is logged.
These aren't AI-specific insights. They're OS design principles from the 1970s. We just forgot to apply them to agents.
What we actually do with this
We use this framing explicitly in architecture reviews with clients. When a team shows us a prototype where every agent has full conversation history (unlimited RAM), shares a single tool registry (no permission scoping), and emits no traces (no kernel instrumentation), we explain it in OS terms:
"You've written a single-process program with global state and no logging. It works until it doesn't, and when it breaks you can't debug it."
Engineers who have never trained a model understand memory paging, process isolation, and system-call tracing. The LLM OS framing gives them vocabulary to reason about agent architecture without any ML background. That vocabulary is the actual point of the analogy.
Applied: what we built on InterviewLM
Using the LLM OS framing:
- CPU routing: Claude Opus for complex reasoning (agent coordination, evaluation), Claude Haiku for simple tasks (formatting, extraction). Dynamic routing based on task type. Model switching mid-session.
- RAM management: Each interviewer persona agent receives only the context relevant to its current turn. Full transcript lives in long-term store. Prompt caching keeps the "kernel" (system prompt) in cache. 90%+ cache hit rate, sub-50ms for cached tokens.
- Filesystem: LangGraph checkpointing stores state at every node. Any failed session can be replayed from the last successful checkpoint.
- Process isolation: Each candidate session is an isolated LangGraph run. A failure in one session cannot affect another.
- Kernel logging: Every agent step emits a LangSmith trace. We can replay any production session exactly.
Result: 8 agents, 100+ concurrent sessions, 40% cost reduction vs. the prior architecture, sub-2s p99 latency.
The one thing to steal from this
Before you build your next agent system, write the OS spec. Fill in the table above for your architecture. If you can't fill every row, you're not ready to start. That table is the minimum viable architecture document.
Next in this series
#3. Eureka Labs (July 2024). Karpathy leaves OpenAI and makes a specific bet: AI doesn't just change software, it changes how humans learn. What that bet tells us about his model of where this all goes.
Quick answers
What do I get from this cable?
You get a dated field note that explains how we handle this ai-industry workflow in real Claude Code projects.
How much time should I budget?
Typical effort is 7 min. The cable is marked intermediate.
How do I install the artifact?
This cable is guidance-only and does not ship an installable artifact.
How fresh is the guidance?
The cable is explicitly last verified on 2026-04-17, and includes source links for traceability.
Work with FRE|Nxt
We build the production AI systems we write about.
Cables are the field notes. The playbooks come from client engagements — multi-agent systems, RAG pipelines, and LLM cost cuts that ship and hold up in production. If something here maps to a problem on your roadmap, two ways in:
Audit capacity: 5 slots/month · No pitch deck · NDA on request
Use auto mode, not --dangerously-skip-permissions
Two flags promise to stop Claude Code from pausing at every tool call. One of them reads your settings, honours your allowlist, and refuses to run anything g…
Publish your stack to Cables (automated)
A skill that walks Claude Code through publishing your Claude stack to the Cables community in one conversation. No manual repo setup, no hand-written `stack…
Replicate Ragav's stack (skills + plugins + scripts)
Pick the stack that matches what you're building. Each one is a single `npx` command. Plugins installed, skills synced, marketplaces configured, no bash scri…