How we built an AI-native technical assessment platform with 8 specialized agents, real-time adaptive difficulty, and comprehensive evaluation using LangGraph 1.0 and LangChain.
Agents Deployed
8
Specialized agents
Concurrent Sessions
100+
Supported
Response Latency
<2s
p99 latency
Cost Optimization
40%
Via prompt caching
Evaluation Accuracy
4D
Scoring with evidence
Cost Per Session
$1.50
Average
InterviewLM is an AI-native technical assessment platform that evaluates candidates' ability to use modern AI coding tools (like Claude Code) in realistic development environments. This case study documents our implementation of a production-grade multi-agent system using LangGraph 1.0 and LangChain to power real-time AI-assisted technical interviews with adaptive difficulty and comprehensive evaluation.
Traditional technical interviews fail to assess how candidates work with AI coding assistants in realistic scenarios. InterviewLM needed to:
Provide AI-assisted coding environments where candidates can use Claude Code naturally
Adapt difficulty in real-time based on candidate performance (Item Response Theory)
Evaluate holistically across code quality, problem-solving, AI collaboration, and communication
Scale cost-effectively for high-volume enterprise hiring
StateGraph Primitives
Type-safe, reproducible state management
Native Checkpointing
Conversation persistence and recovery
Conditional Routing
Dynamic multi-agent orchestration
Streaming Support
Real-time user experience
We implemented a hierarchical orchestrator-worker pattern with 8 specialized agents:
Session Orchestrator
Supervisor Agent
Coding Agent
Candidate-facing
Interview Agent
Background
Evaluation Agent
Post-session
Question Gen
Dynamic
IRT Algorithm
Adaptive Difficulty
4-Dimension Scoring
Evidence-based
Modal Sandbox Runtime
Secure code execution environment
| Agent | Type | Purpose |
|---|---|---|
| Coding Agent | Candidate-facing | ReAct agent helping candidates write code with configurable helpfulness levels |
| Interview Agent | Background | IRT-based state machine tracking performance, adapting difficulty (hidden from candidates) |
| Evaluation Agent | Post-session | Evidence-based scoring across 4 dimensions with agentic data discovery |
| Fast Progression Agent | Real-time | Speed-optimized (20-40s) gate check for question advancement |
| Comprehensive Agent | Quality-focused | Deep evaluation (3-5 min) for hiring manager reports |
| Question Generation Agent | Dynamic | LLM-powered question variant generation with difficulty targeting |
| Question Evaluation Agent | Analysis | Per-question solution assessment |
| Supervisor Agent | Orchestrator | Coordinates handoffs between specialized agents |
We designed a composable middleware stack with 15 layers that intercept requests before and after model execution:
Request
Before Middleware
Prompt Caching
Model Selection
Turn Guidance
Model Execution
After Middleware
State Extraction
Checkpointing
Persistence
Response
We implemented Anthropic's prompt caching with three fixed breakpoints for predictable cache behavior:
Tier 1
System Prompt
~15K tokens
100% Cache Rate
Tier 2
Tool Definitions
~5K tokens
100% Cache Rate
Tier 3
Message Context
~2K tokens
100% Cache Rate
Result: 40% reduction in token costs for long conversations
The Interview Agent implements psychometrically valid adaptive testing that converges to accurate ability estimates:
Question Result
Correct/Incorrect
P = 1 / (1 + e-(θ - b))
θ = ability estimate, b = difficulty
Updated θ
New difficulty target
-3 to +3
Theta Range
5-10
Questions to Converge
1-10
Difficulty Scale
The Interview Agent must never leak evaluation data to candidates. We implemented 5 isolation layers:
Network Layer
API Filtering
Context Isolation
Tool Access
Audit Logging
| Component | Choice | Rationale |
|---|---|---|
| Hosting | GCP Cloud Run | Auto-scaling, VPC integration, IAM auth |
| Database | Cloud SQL (PostgreSQL) | Checkpointer storage, state persistence |
| Caching | Memorystore (Redis) | Session state, metrics caching |
| Secrets | Secret Manager | API keys, credentials |
| Observability | LangSmith + Sentry | Tracing, error monitoring |
| Metric | Before | After | Improvement |
|---|---|---|---|
| Cold start latency | 8-12s | 2-3s | 70% faster |
| Token costs per session | $2.50 | $1.50 | 40% reduction |
| Max context utilization | 60% | 95% | Better conversations |
| Evaluation accuracy | N/A | 4-dimension scoring | Comprehensive |
Scalability: 100+ concurrent interview sessions supported
Cost Efficiency: $1.50 average cost per assessment
Candidate Experience: Real-time AI assistance with streaming responses
Evaluation Quality: Evidence-based scoring with bias detection
Deployment Velocity: Infrastructure-as-code with CI/CD
Use TypedDict state schemas for type safety
Middleware for cross-cutting concerns
Deterministic thread IDs (UUIDv5)
State machine agents when no LLM needed
Parallel tool calls for latency reduction
Prompt caching is transformative (40% savings)
Model tiering by task complexity
State machine agents = zero LLM cost
Batched tool calls reduce round trips
| Layer | Technology |
|---|---|
| Agent Framework | LangGraph 1.0, LangChain |
| Language Models | Claude Sonnet 4.5, Claude Haiku 4.5 |
| Backend | Python 3.12, FastAPI |
| Frontend | Next.js 15, TypeScript, React 19 |
| Infrastructure | GCP Cloud Run, Terraform |
| Observability | LangSmith, Sentry |
Let's discuss how LangGraph and LangChain can power your next AI project.
Case Study Prepared by: Frenxt Consultancy
Date: January 2026 | Version: 1.0
*Based on a real production implementation.