Case Study

Building a 93% Accurate Grocery Classifier with LLM Agents

Multilingual Entity Resolution for Indian Grocery Items Across 12+ Languages

How we evolved a naive LLM call into a self-correcting multi-agent classification system that processes real grocery orders daily at $0.03 per batch of 255 items.

12 min read

April 2026

Household Management Platform

93%

Category Accuracy

$0.03

Cost per Batch

12+

Languages Supported

723

Taxonomy Categories

LangGraphLangChainPythonGPT-5.4 NanoPostgreSQLOpenAI Structured Outputs

LLM-based grocery classification is the process of using large language models to categorize raw grocery text into a structured product taxonomy. Unlike traditional text classification, it requires multilingual entity resolution — mapping items written in Hindi, Tamil, English, Hinglish, or any of India's 20+ major languages to specific database entities with variant-level specificity.

Our client operates a household management platform serving Indian families. Users add grocery items to shopping carts via WhatsApp messages — and the items arrive as raw text:

"Amul Toned Milk Pouch 500ml"

English, branded

"दूध 1L"

Hindi (Devanagari)

"aata 10kg whole wheat"

Hinglish, generic

"Haldiram's Aloo Bhujia Namkeen 350g"

English, branded snack

Each item needs classification into a 723-entry product taxonomy of Category > SubCategory > Type — from "Dairy, Bread and Eggs > Milk > Toned Milk" to "Masala, Pickle and Papad > Powdered Spices > Turmeric Powder" — then resolved to a database entity with brand, quantity, and unit specificity.

Architecture: Two-Tier Classification

We designed a two-tier system that optimizes for both cost and accuracy. The alias cache handles the 95%+ of items that have been seen before, while the LLM agent classifies genuinely novel items.

Raw grocery text in 12+ languages

Tier 1

Alias Cache

PostgreSQL pg_trgm fuzzy matching

95%+ hit rate on warm cache

Free

cache miss

Tier 2

LLM Classification

GPT-5.4 Nano with ReAct agent

93% accuracy on novel items

$0.03/batch

Classified into 723-category taxonomy

Critical Design Decision

LLMs return text fields, never UUIDs. Our first iteration asked the LLM to return category UUIDs directly. It achieved 0% accuracy — the model hallucinated plausible-looking UUIDs that didn't exist in the database. Text-based resolution eliminated this entirely.

Iteration 1: The Naive Approach

72.2% accuracy

Our first working version was a single LLM call per batch of 50 items using LangChain's with_structured_output with the default method (function calling):

classifier.py

python

structured_llm = llm.with_structured_output(ClassificationResult)
result = await structured_llm.ainvoke([system_msg, user_msg])

Three problems destroyed accuracy:

1

Alignment Bug — Positional Result Matching

The benchmark runner assembled results as matched + resolved + unresolved, breaking positional alignment between input items and output results. Item 5's result landed in slot 3.

Fix

Match results to inputs by rawText key, not position. Build a dict {rawText: result} and look up each input item.

2

function_calling vs json_schema

LangChain's with_structured_output defaults to OpenAI's function calling mechanism. Switching to method="json_schema" — which uses OpenAI's native Structured Outputs API with constrained decoding — gave an immediate accuracy boost.

classifier.py

python

# One line change worth 19 percentage points
llm.with_structured_output(ClassificationResult, method="json_schema")

3

Taxonomy Ambiguity

Spices exist in both "Staples > Spices" (with Hindi names like "Haldi Powder") and "Masala, Pickle and Papad > Powdered Spices" (with English names like "Turmeric Powder"). The LLM had no guidance on which to choose.

Fix

Add a disambiguation rule: "Choose the category whose type name is the CLOSEST text match to the item."

Impact

These three fixes took accuracy from 72% to 91% in one iteration. The json_schema switch alone was worth 19 percentage points — the single biggest improvement in the entire project.

Iteration 2: Indian Language Support

89-91% accuracy

19 of 24 Hindi items were already classified correctly. The 5 failures were all generic unbranded items: "दूध" (milk), "आटा" (flour), "चीनी" (sugar). The LLM knew these were dairy/cereal/sweetener items but defaulted to "Others" as the type — it couldn't confidently pick "Toned Milk" vs "Full Cream Milk" for a generic Hindi word.

Default-type rules for unbranded items reflecting Indian household purchasing patterns:

prompt_rules.txt

text

# Indian household defaults
milk / dudh / paal  →  "Toned Milk"       # Most common milk type
atta / flour        →  "Chakki Atta"       # Most common flour
sugar / cheeni      →  "Sulphur-Free Sugar" # Most common sugar

We also added normalized text matching — Unicode NFC, apostrophe variants, case folding — to handle the LLM occasionally modifying Hindi raw text in its output.

Iteration 3: The Agent Approach

93% accuracy

At 89-91%, the remaining failures fell into patterns that a single LLM call couldn't fix:

Invalid taxonomy triples

LLM generated (category, subCategory, type) combinations that didn't exist in the database

Brand-category misrouting

MDH Garam Masala going to "Staples" instead of "Masala, Pickle and Papad"

"Others" over-selection

LLM defaulting to "Others" when a specific type was available

Key Insight

These are all verifiable errors. We can check if a triple exists, check if "Others" has alternatives, check brand-category associations. What we needed was a validation loop, not a better prompt.

The Classification Agent

We built a LangGraph ReAct agent that orchestrates five steps with a validation loop for invalid items:

lookup_items

Check alias cache for known items

1/5

classify_batch

LLM classifies novel items via json_schema

2/5

validate_and_resolve

Check triples against taxonomy DB

3/5

fix_invalid

Agent reasons about corrections

4/5

resolve_items

Map to database entities (variantIds)

5/5

Invalid items loop back through fix_invalid with taxonomy suggestions

What validate_and_resolve Does

1

Triple validation

Check each (category, subCategory, type) against the taxonomy database

2

Brand routing

Flag masala brands (MDH, Everest, MTR) misrouted to "Staples"

3

"Others" flagging

When type is "Others" but specific types exist, force a correction

4

Auto-resolve valid items

Items with valid triples go straight to database entity resolution

5

Return invalid with suggestions

Agent gets the item plus close taxonomy matches to pick from

The agent LLM then reasons about each invalid item — reading the suggestions and picking the best correction based on item text and brand. This is where the agent approach genuinely outperforms the single-call approach: the fixing step requires contextual reasoning that a structured output call can't do.

Sub-Agent Integration

The classification agent is invoked internally from catalog_utils.classify_items() — the function signature stays identical. All callers benefit transparently:

catalog_utils.py

python

async def classify_items(client, items) -> dict[str, str | None]:
    set_client(client)  # Inject API client into agent tools
    agent = create_catalog_agent()
    result = await agent.ainvoke({
        "messages": [HumanMessage(content=json.dumps(batch))]
    })
    # Extract variantIds from tool call results
    ...

This follows LangGraph's "invoke subgraph from a node function" pattern. Parent agents never know classification is handled by a sub-agent.

Benchmarking: The Backbone of Iteration

We built a comprehensive benchmarking system early — a 255-item synthetic dataset with ground-truth labels covering:

24 Hindi/Devanagari items across all categories

Branded and unbranded items

Overlapping taxonomy regions (Masala vs Staples)

Edge cases (items in multiple categories)

Run Benchmarks After Every Change

Our "Indian language support" iteration briefly regressed overall accuracy from 91% to 88% because the expanded prompt changed the LLM's attention on previously-correct items. We caught this immediately and tightened the prompt.

Results

IterationApproachCategory AccuracyCost / 255 items

Naive

Single LLM call (function_calling)

72.2%

$0.02

Alignment + json_schema

Single LLM call (fixed)

91.0%

$0.02

Indian language support

Single LLM call + prompt rules

88-91%

$0.02

Classification agent

ReAct agent with validation

93%

$0.03

The agent adds ~50% more cost (extra orchestration LLM calls) but catches and fixes errors that no single-call approach can detect.

6 Lessons from Building LLM Classification Systems

Match by Identity, Not Position

Any system that batches items for LLM classification and assumes response order matches input order will have alignment bugs. Always match by a unique key.

json_schema > function_calling

OpenAI's Structured Outputs API uses constrained decoding and produces higher accuracy than function calling for classification. A one-line change worth 19 percentage points.

Validation is Cheap, Hallucination is Expensive

Checking a taxonomy triple costs nothing. Letting an invalid triple through creates a wrong alias that poisons all future cache lookups.

Prompts are Products, Not Instructions

Our system prompt has disambiguation rules, brand routing, language defaults, and few-shot examples. Each line was added to fix a specific benchmark failure.

Agents Beat Prompts for Verifiable Errors

When you can programmatically check if output is valid, an agent loop outperforms prompt engineering. It just needs to catch errors and fix them.

Dataset Labels Can Be Wrong

Four items in our benchmark had wrong labels. The LLM was classifying correctly but our ground truth was wrong. Fix the dataset, not the model.

The Path to 95%+

The remaining ~7% of errors break down into addressable categories:

~2-3%

LLM non-determinism

Same input gives slightly different outputs across runs. Temperature=0 doesn't fully eliminate this.

~2-3%

Genuine ambiguity

"Haldiram's Aloo Bhujia Pack 200g" vs "Haldiram's Aloo Bhujia Namkeen 350g" belong in different categories based on a single word.

~1-2%

Hindi rawText modification

The LLM occasionally transliterates Hindi input despite explicit instructions.

The path to 95%+ likely involves: model upgrade (GPT-5.4 vs Nano), confidence-based retry for uncertain items, and expanding the alias cache to cover more items in Tier 1.

Building an AI Classification System?

We specialize in production-grade LLM systems — from multilingual entity resolution to multi-agent orchestration. Our research-first methodology means we validate before we build.

Talk to our team

Summary

This system processes real grocery orders daily, classifying items across 12+ Indian languages into a 723-category taxonomy at $0.03 per batch. The agent-based approach turned a brittle single-call pipeline into a self-correcting system that improves with every classification. The three highest-impact changes were: matching results by identity instead of position, switching from function_calling to json_schema structured outputs (+19 percentage points), and adding a ReAct validation agent that catches verifiable errors the LLM makes on its first pass.

Need Help Building LLM-Powered Classification?

We build production-grade AI systems — multi-agent orchestration, multilingual NLP, and LLM cost optimization. 4-6 weeks to first deliverable.

#LLM Classification#Multi-Agent Systems#LangGraph#LangChain#Structured Outputs#Multilingual NLP#Indian Languages#Entity Resolution#Product Taxonomy#GPT#Python#Cost Optimization#ReAct Agent

Written by: FRE|Nxt Labs

Published: April 2026 | Last updated: April 2026

*Based on production implementation for a household management platform client.