Multilingual Entity Resolution for Indian Grocery Items Across 12+ Languages
How we evolved a naive LLM call into a self-correcting multi-agent classification system that processes real grocery orders daily at $0.03 per batch of 255 items.
12 min read
April 2026
Household Management Platform
93%
Category Accuracy
$0.03
Cost per Batch
12+
Languages Supported
723
Taxonomy Categories
LLM-based grocery classification is the process of using large language models to categorize raw grocery text into a structured product taxonomy. Unlike traditional text classification, it requires multilingual entity resolution — mapping items written in Hindi, Tamil, English, Hinglish, or any of India's 20+ major languages to specific database entities with variant-level specificity.
Our client operates a household management platform serving Indian families. Users add grocery items to shopping carts via WhatsApp messages — and the items arrive as raw text:
"Amul Toned Milk Pouch 500ml"English, branded
"दूध 1L"Hindi (Devanagari)
"aata 10kg whole wheat"Hinglish, generic
"Haldiram's Aloo Bhujia Namkeen 350g"English, branded snack
Each item needs classification into a 723-entry product taxonomy of Category > SubCategory > Type — from "Dairy, Bread and Eggs > Milk > Toned Milk" to "Masala, Pickle and Papad > Powdered Spices > Turmeric Powder" — then resolved to a database entity with brand, quantity, and unit specificity.
We designed a two-tier system that optimizes for both cost and accuracy. The alias cache handles the 95%+ of items that have been seen before, while the LLM agent classifies genuinely novel items.
Raw grocery text in 12+ languages
Tier 1
Alias Cache
PostgreSQL pg_trgm fuzzy matching
95%+ hit rate on warm cache
Freecache miss
Tier 2
LLM Classification
GPT-5.4 Nano with ReAct agent
93% accuracy on novel items
$0.03/batchClassified into 723-category taxonomy
Critical Design Decision
LLMs return text fields, never UUIDs. Our first iteration asked the LLM to return category UUIDs directly. It achieved 0% accuracy — the model hallucinated plausible-looking UUIDs that didn't exist in the database. Text-based resolution eliminated this entirely.
Our first working version was a single LLM call per batch of 50 items using LangChain's with_structured_output with the default method (function calling):
classifier.py
python
structured_llm = llm.with_structured_output(ClassificationResult)
result = await structured_llm.ainvoke([system_msg, user_msg])Three problems destroyed accuracy:
1
Alignment Bug — Positional Result Matching
The benchmark runner assembled results as matched + resolved + unresolved, breaking positional alignment between input items and output results. Item 5's result landed in slot 3.
Fix
Match results to inputs by rawText key, not position. Build a dict {rawText: result} and look up each input item.
2
function_calling vs json_schema
LangChain's with_structured_output defaults to OpenAI's function calling mechanism. Switching to method="json_schema" — which uses OpenAI's native Structured Outputs API with constrained decoding — gave an immediate accuracy boost.
classifier.py
python
# One line change worth 19 percentage points
llm.with_structured_output(ClassificationResult, method="json_schema")3
Taxonomy Ambiguity
Spices exist in both "Staples > Spices" (with Hindi names like "Haldi Powder") and "Masala, Pickle and Papad > Powdered Spices" (with English names like "Turmeric Powder"). The LLM had no guidance on which to choose.
Fix
Add a disambiguation rule: "Choose the category whose type name is the CLOSEST text match to the item."
Impact
These three fixes took accuracy from 72% to 91% in one iteration. The json_schema switch alone was worth 19 percentage points — the single biggest improvement in the entire project.
19 of 24 Hindi items were already classified correctly. The 5 failures were all generic unbranded items: "दूध" (milk), "आटा" (flour), "चीनी" (sugar). The LLM knew these were dairy/cereal/sweetener items but defaulted to "Others" as the type — it couldn't confidently pick "Toned Milk" vs "Full Cream Milk" for a generic Hindi word.
Default-type rules for unbranded items reflecting Indian household purchasing patterns:
prompt_rules.txt
text
# Indian household defaults
milk / dudh / paal → "Toned Milk" # Most common milk type
atta / flour → "Chakki Atta" # Most common flour
sugar / cheeni → "Sulphur-Free Sugar" # Most common sugarWe also added normalized text matching — Unicode NFC, apostrophe variants, case folding — to handle the LLM occasionally modifying Hindi raw text in its output.
At 89-91%, the remaining failures fell into patterns that a single LLM call couldn't fix:
Invalid taxonomy triples
LLM generated (category, subCategory, type) combinations that didn't exist in the database
Brand-category misrouting
MDH Garam Masala going to "Staples" instead of "Masala, Pickle and Papad"
"Others" over-selection
LLM defaulting to "Others" when a specific type was available
Key Insight
These are all verifiable errors. We can check if a triple exists, check if "Others" has alternatives, check brand-category associations. What we needed was a validation loop, not a better prompt.
We built a LangGraph ReAct agent that orchestrates five steps with a validation loop for invalid items:
lookup_itemsCheck alias cache for known items
1/5
classify_batchLLM classifies novel items via json_schema
2/5
validate_and_resolveCheck triples against taxonomy DB
3/5
fix_invalidAgent reasons about corrections
4/5
resolve_itemsMap to database entities (variantIds)
5/5
Invalid items loop back through fix_invalid with taxonomy suggestions
1
Triple validation
Check each (category, subCategory, type) against the taxonomy database
2
Brand routing
Flag masala brands (MDH, Everest, MTR) misrouted to "Staples"
3
"Others" flagging
When type is "Others" but specific types exist, force a correction
4
Auto-resolve valid items
Items with valid triples go straight to database entity resolution
5
Return invalid with suggestions
Agent gets the item plus close taxonomy matches to pick from
The agent LLM then reasons about each invalid item — reading the suggestions and picking the best correction based on item text and brand. This is where the agent approach genuinely outperforms the single-call approach: the fixing step requires contextual reasoning that a structured output call can't do.
The classification agent is invoked internally from catalog_utils.classify_items() — the function signature stays identical. All callers benefit transparently:
catalog_utils.py
python
async def classify_items(client, items) -> dict[str, str | None]:
set_client(client) # Inject API client into agent tools
agent = create_catalog_agent()
result = await agent.ainvoke({
"messages": [HumanMessage(content=json.dumps(batch))]
})
# Extract variantIds from tool call results
...This follows LangGraph's "invoke subgraph from a node function" pattern. Parent agents never know classification is handled by a sub-agent.
We built a comprehensive benchmarking system early — a 255-item synthetic dataset with ground-truth labels covering:
24 Hindi/Devanagari items across all categories
Branded and unbranded items
Overlapping taxonomy regions (Masala vs Staples)
Edge cases (items in multiple categories)
Run Benchmarks After Every Change
Our "Indian language support" iteration briefly regressed overall accuracy from 91% to 88% because the expanded prompt changed the LLM's attention on previously-correct items. We caught this immediately and tightened the prompt.
| Iteration | Approach | Category Accuracy | Cost / 255 items |
|---|---|---|---|
Naive | Single LLM call (function_calling) | 72.2% | $0.02 |
Alignment + json_schema | Single LLM call (fixed) | 91.0% | $0.02 |
Indian language support | Single LLM call + prompt rules | 88-91% | $0.02 |
Classification agent | ReAct agent with validation | 93% | $0.03 |
The agent adds ~50% more cost (extra orchestration LLM calls) but catches and fixes errors that no single-call approach can detect.
Match by Identity, Not Position
Any system that batches items for LLM classification and assumes response order matches input order will have alignment bugs. Always match by a unique key.
json_schema > function_calling
OpenAI's Structured Outputs API uses constrained decoding and produces higher accuracy than function calling for classification. A one-line change worth 19 percentage points.
Validation is Cheap, Hallucination is Expensive
Checking a taxonomy triple costs nothing. Letting an invalid triple through creates a wrong alias that poisons all future cache lookups.
Prompts are Products, Not Instructions
Our system prompt has disambiguation rules, brand routing, language defaults, and few-shot examples. Each line was added to fix a specific benchmark failure.
Agents Beat Prompts for Verifiable Errors
When you can programmatically check if output is valid, an agent loop outperforms prompt engineering. It just needs to catch errors and fix them.
Dataset Labels Can Be Wrong
Four items in our benchmark had wrong labels. The LLM was classifying correctly but our ground truth was wrong. Fix the dataset, not the model.
The remaining ~7% of errors break down into addressable categories:
LLM non-determinism
Same input gives slightly different outputs across runs. Temperature=0 doesn't fully eliminate this.
Genuine ambiguity
"Haldiram's Aloo Bhujia Pack 200g" vs "Haldiram's Aloo Bhujia Namkeen 350g" belong in different categories based on a single word.
Hindi rawText modification
The LLM occasionally transliterates Hindi input despite explicit instructions.
The path to 95%+ likely involves: model upgrade (GPT-5.4 vs Nano), confidence-based retry for uncertain items, and expanding the alias cache to cover more items in Tier 1.
Building an AI Classification System?
We specialize in production-grade LLM systems — from multilingual entity resolution to multi-agent orchestration. Our research-first methodology means we validate before we build.
Talk to our teamThis system processes real grocery orders daily, classifying items across 12+ Indian languages into a 723-category taxonomy at $0.03 per batch. The agent-based approach turned a brittle single-call pipeline into a self-correcting system that improves with every classification. The three highest-impact changes were: matching results by identity instead of position, switching from function_calling to json_schema structured outputs (+19 percentage points), and adding a ReAct validation agent that catches verifiable errors the LLM makes on its first pass.
We build production-grade AI systems — multi-agent orchestration, multilingual NLP, and LLM cost optimization. 4-6 weeks to first deliverable.
Written by: FRE|Nxt Labs
Published: April 2026 | Last updated: April 2026
*Based on production implementation for a household management platform client.