Re-Architecting DungBeetle on Apache Iceberg

Re-architecting DungBeetle for 2026 keeps its core idea (compute the heavy SELECT once, then slice the cached result cheaply) and replaces the dated substrate: move the result store to Apache Iceberg on object storage, route each query to the engine that fits it (DuckDB, StarRocks, Trino, RisingWave), and put a durable workflow engine underneath instead of Redis plus a queue.

Last updated: May 17, 2026.

TL;DR. DungBeetle's core idea (compute the heavy SELECT once, then let users slice the cached result without re-hitting the primary database) is correct and aged well. What aged badly is the assumption that the cache is "another SQL database with one table per job." Move the result store to Apache Iceberg on object storage, route each query to the engine that fits it (DuckDB for small, StarRocks for heavy MPP, Trino for federation, RisingWave for streaming), and put a durable workflow engine underneath. You keep the design intent and drop every 2018-era constraint.

What DungBeetle Got Right, and What Aged Badly

DungBeetle is a job server for heavy SQL reports. The pattern is simple and good: drop a long-running aggregation onto something that can spend a few minutes on it, cache the result, then make the follow-up filter, sort, and paginate cheap. The expensive part runs once. The slicing runs against the cache, not the stressed primary database.

That separation is the load-bearing idea, and it is worth being explicit about the parts of the original design that should survive any rewrite:

Compute is separated from slicing. Heavy aggregation is paid once. Follow-up reads are cheap.
SQL files are the unit of work. An analyst can code-review a task. It version-controls cleanly. No YAML graph required.
Declarative per-task metadata. Comment-frontmatter (-- db:, -- queue:, -- results:) attaches routing without ceremony.
Multi-queue priority lanes plus worker pools. A clean horizontal-scale story.
Poll-by-job-id. Fine, as long as you layer push on top rather than ripping it out.

Keep all of that. The thing that dates the design is everything underneath it. One SQL table per job causes catalog bloat, makes every job a DDL event, forces lossy coercion into a six-type vocabulary, gives you no real TTL, and leaves cancellation half-working because you cannot reliably kill a MySQL query. In 2026 the natural home for that cache is an open table format on object storage, and the natural way to read it is whichever engine is fastest for the shape of the query you are running.

The New Architecture at a Glance

flowchart TD C["Client"] GW["API Gateway
OIDC auth, per-tenant quotas, rate limits"] ORCH["Orchestrator: River or Temporal
Durable workflows, idempotent enqueue,
retries, DAG groups, real cancellation"] RT["Query Planner / Engine Router
EXPLAIN, estimate bytes, joins,
federation, score, pick"] DUCK["DuckDB
small"] STAR["StarRocks
MPP"] TRINO["Trino
federate"] RW["RisingWave
streaming"] ICE["Apache Iceberg
REST catalog: Lakekeeper / Polaris
S3 / GCS / MinIO, per-tenant namespace
Retention: partition delete + expire_snapshots"] C -->|"HTTP/gRPC + SSE + webhook"| GW GW --> ORCH ORCH --> RT RT --> DUCK RT --> STAR RT --> TRINO RT --> RW DUCK --> ICE STAR --> ICE TRINO --> ICE RW --> ICE

The new architecture: client to an authenticated gateway, a durable orchestrator, a per-query engine router, and Apache Iceberg as the single result store.

Eight layers. Each one removes a specific constraint the original could not solve. Walking them in order:

Layer 1: The Result Store Is Just Apache Iceberg

Start here, because it collapses the largest class of problems. There is no separate "results database" anymore. The Iceberg table is the result.

Why Iceberg, not Delta or Hudi. All three are credible. For a reporting workload built around schema diversity and many small concurrent writers, Iceberg's hidden partitioning, partition evolution without rewrites, branching, and time travel are the deciding features. The catalog ecosystem matured fast across 2024 and 2025: Apache Polaris and Lakekeeper are production-ready open-source REST catalogs, and Nessie adds Git-style versioning (though its own docs still mark the Iceberg REST endpoint experimental, so treat it as the versioning option, not the default REST catalog). Either way, you are no longer locked to AWS Glue or a vendor.

Table layout. Instead of one table per job, define one Iceberg table per report family (payments.daily_pnl, risk.intraday_exposures), partitioned by (tenant_id, job_id, run_date). Each job appends a snapshot. Reads filter on job_id and hit a single partition. Thousands of results_<uuid> tables collapse into a handful of curated tables with real schemas, comments, owners, and lineage. Small-file pressure is handled by Iceberg's rewrite_data_files compaction running asynchronously.

What this enables, once you run the right maintenance and respect engine compatibility:

Retention becomes a metadata delete, not a DROP-table cron. Be precise about this: expire_snapshots only reclaims data that no retained snapshot still references. It is not a row-level TTL on append-only results. The actual pattern is a partition-scoped delete by run_date (a metadata-only operation in Iceberg), followed by expire_snapshots plus orphan-file cleanup to reclaim the storage. That replaces the README's struck-through ttl parameter, but it is a maintenance job you own, not a free one-liner.
Type fidelity. Nested structs, lists, maps, fixed-precision decimals, UUIDs, and binary are all native, so an array of structs is a first-class citizen instead of a coercion casualty. True JSON-shaped payloads use the v3 variant type, where engine support still lags (Spark 4.0 is furthest along, Trino and others are catching up), so verify your engines before relying on it.
Time travel. "What did this report look like at last Tuesday's snapshot" is a built-in query.
Branching and write-audit-publish. A new query can be validated on a branch before the prod snapshot moves. Note this is a Spark/Flink-side maintenance operation today: the read engines here (DuckDB, StarRocks, Trino) consume the resolved snapshot, they do not all create or merge branches, so treat branching as a pipeline capability, not a per-query one.
Portability. The same result is queryable from Spark, Flink, Athena, Snowflake's Iceberg reader, anything that speaks the Iceberg REST catalog.

Layer 2: The Query Engine Router Is the Actual Idea

This is the decision DungBeetle never made. It assumes one engine per source database, full stop. In a lakehouse, the engine is a choice, and the right choice depends on the query.

A heuristic-driven router beats any single-engine deployment because the cost of running the wrong engine is asymmetric. A 30-second Trino query that DuckDB would have finished in 200ms is fine for the user but wastes cluster capacity. A query DuckDB OOMs on, but StarRocks would have streamed through, is a customer-facing failure. You want to be wrong in the cheap direction, and the only way to do that reliably is to pick per query.

Engine	Use it when	Why
DuckDB (embedded in the worker)	Estimated scan under a few GB. Single-table point lookups, small aggregations, the slice-and-dice follow-up reads.	Zero network hop, vectorised, talks Iceberg natively via `iceberg_scan`. This is the workhorse the secondary results-DB was originally invented for.
StarRocks (or ClickHouse)	Heavy MPP aggregation across one big fact table and a few dimensions. "Scans 100GB and does seven joins."	Vectorised, pipeline parallelism, materialised views if you want them.
Trino	One statement that federates across Iceberg + Postgres + Kafka + MongoDB.	The connector ecosystem is the most mature. Nothing else comes close for true federation.
RisingWave (or Materialize)	The small subset of reports that should be a continuously maintained materialised view over a stream.	These never enter the job queue. They are always fresh.
DataFusion / Polars	Rust or Python shops that want the embedded-engine property without DuckDB.	Same in-process, zero-hop characteristic.

How the router decides. Do not assume a generic EXPLAIN gives reliable estimates everywhere: Trino prints ? or NaN without table statistics, StarRocks needs EXPLAIN COSTS, and DuckDB has its own Iceberg planning path. The robust signal is a blend: Iceberg manifest metadata for partition pruning and bytes-per-partition, per-engine planner estimates where stats exist, and observed runtime histograms for everything else. From that blend, score it: federation needed goes to Trino, very large scans go to StarRocks, everything else goes to DuckDB. An -- engine_hint: directive lets an analyst override. Track engine-per-template runtimes in a Prometheus histogram and let the router learn: if the historical p95 for "report X on DuckDB" creeps above 10s, auto-promote it to StarRocks.

The pluggable engine adapter interface is small on purpose:

type Engine interface {
    Name() string
    Score(plan Plan) float64           // 0..1, how good a fit
    Execute(ctx, sql, args, sink) (QueryID, error)
    Cancel(ctx, QueryID) error
    Stats(ctx, QueryID) (RuntimeStats, error)
}

Adding a new engine is a Go file. The orchestrator never knows or cares which engine ran the query. It gets a QueryID it can use for cancellation, which is what finally makes the cancel-job API work. The MySQL gap from the original is gone, because there is no MySQL.

Layer 3: Orchestration Is a Durable Workflow, Not Redis + a Queue

tasqueue plus Redis served the original well. For a 2026 platform the right primitive is a durable workflow engine. Two reasonable picks:

River, if you already run Postgres. Postgres-backed, Go-native, simple operational model. Exactly-once enqueue via unique-key constraints, retries with jitter, periodic jobs.
Temporal, if you have multi-step DAG-shaped reports with branching, compensation, or human-in-the-loop steps. Heavier ops cost, but it pays for itself the moment you need "run A, then B and C in parallel, then D if both succeeded, otherwise alert."

The wins over Redis plus tasqueue are concrete, with one caveat stated up front: durable workflows give you exactly-once enqueue (transactional insert plus unique-key dedup in River, durable workflow state in Temporal), but execution is still at-least-once, so you own idempotency on the result write regardless of engine. Past that: idempotency keys on the result write so a retry does not double-append to Iceberg, durable state that survives worker crashes, and cancellation that propagates to a real engine query-id instead of hopefully-someday-killing a MySQL query. The simple fan-out group DungBeetle already supported is a few lines on either engine. Anything more complex, you are no longer stuck.

Layer 4: Keep the SQL Files, Extend the Frontmatter

The task model stays. SQL files with comment-frontmatter are a good format and there is no reason to invent a new one. Extend the directives to cover the new capabilities:

-- name: daily_pnl
-- engine_hint: starrocks
-- cost_budget_bytes: 50_000_000_000
-- max_runtime: 10m
-- retention: 30d
-- result_table: payments.daily_pnl
-- partition_by: tenant_id, run_date
-- emit_format: parquet,arrow
SELECT
  tenant_id,
  date_trunc('day', ts) AS run_date,
  SUM(amount)           AS pnl,
  COUNT(*)              AS trades
FROM raw_trades
WHERE ts >= $1 AND ts < $2
GROUP BY 1, 2;

At load time every task is validated. EXPLAIN runs against the chosen engine, partition pruning is verified, and the resulting Iceberg schema is diffed against the declared result_table. Additive drift (a new nullable column) is applied automatically, because Iceberg handles that natively. Breaking drift fails the task at load and blocks the deploy. That single property is worth the rewrite on its own: schema regressions stop being a 2am incident and become a failed CI check.

Layer 5: The Client API Is Push, Not Poll

Polling every second for job status is fine internally and archaic from a client's perspective. Layer push on top, without removing the poll:

gRPC with a REST gateway for the control plane. Connect-Go is excellent for this.
Server-Sent Events for status streams. Cheap, browser-friendly, no WebSocket complexity.
Webhook callback URL per job, signed with HMAC.
Signed pre-fetch URLs for direct result download from object storage in Parquet, Arrow IPC, JSON, or CSV. No proxying multi-GB result sets through the API.
A stateless result-query endpoint that accepts a follow-up SQL filter against a finished job and routes it through DuckDB. This is the direct successor to DungBeetle's "slice the results DB cheaply" property, except the result is a Parquet file on S3 and the slicer is an embedded vectorised engine.

The polling API stays available for clients that prefer it. Nothing forces an upgrade.

Layer 6: Multi-Tenancy, Auth, and Cost Control

DungBeetle assumes a trusted private-network caller. For a modern platform that is a non-starter.

OIDC at the gateway, tenant ID injected into every job and carried through the orchestrator.
Per-tenant Iceberg namespaces with bucket-prefix IAM policies, so cross-tenant reads are physically impossible, not just enforced in code.
Engine resource groups. Trino resource groups and StarRocks resource groups give per-tenant CPU and memory limits inside one cluster. DuckDB runs in a worker container with cgroup limits.
Cost budgets enforced at plan time using EXPLAIN estimates against the historical p95 for that template. A query that would blow the tenant's daily byte budget is rejected with a meaningful error before it runs, not after it has burned the cluster.
Per-tenant queue priority lanes, exactly the way DungBeetle did it. That part needs no change.

Layer 7: Observability and Lineage Are First-Class

DungBeetle has none of this. You get all three by choosing modern building blocks:

OpenTelemetry traces spanning enqueue, plan, route, execute, write, notify. Every span carries tenant_id, template_id, engine, job_id. You can finally answer "where did the 30 seconds go?"
Prometheus metrics keyed by template, engine, and tenant. Histograms for runtime, gauges for active workers, counters for retries and cancellations. This is also what feeds the router's learned promotion.
OpenLineage events emitted on every Iceberg write. Drop them into Marquez or DataHub for free dataset-level lineage from source columns through to the consumer report.

Layer 8: The Migration Path Is Reversible

A clean migration in four phases, each independently shippable and individually reversible. No big-bang cutover.

Dual-write. Stand up Iceberg, Lakekeeper, and one engine (Trino is the safest default). Modify the existing DungBeetle worker to append to Iceberg in addition to the SQL results DB. Validate row counts and checksums for a couple of weeks.
Cut readers over. Point report consumers at Iceberg. Retire the results DB once nothing reads it.
Replace the orchestrator. Swap tasqueue for River or Temporal. Task SQL files stay byte-identical.
Introduce the engine router. Start with engine_hint: directives on the heaviest queries, then turn on automatic routing once you trust the planner's scoring.

When NOT to Do This

This design is genuinely worse for some workloads. Honest trade-offs matter more than the pitch.

Latency floor. Object storage first-touch is 50 to 200ms even with warm DuckDB caches. If your reports are tiny and your QPS is high, stashing results in a small MySQL is faster end-to-end.
Operational surface. You are now running a catalog service, object storage, an orchestrator, and one or more query engines. For a three-person team that is too much. The original DungBeetle is a single binary plus Redis, and that is a feature.
Cluster cost. Trino and StarRocks clusters are not free. If your workload is bursty and small, run DuckDB-in-worker only and skip the MPP engines until you have a query they cannot handle.
OLTP shape. Parquet cold scans on object storage are 5 to 10x slower than a warm RDBMS index hit for a single-row lookup. Lakehouses are for analytics. If a query reads one row by primary key, route it to the source DB and do not pretend otherwise.

If your problem is Zerodha-shaped (heavy RDBMS reports against a stressed primary, internal trusted callers, a single binary you can deploy in an afternoon) the original DungBeetle is still the right answer. Do not rebuild because rebuilding is fun.

Stack Recommendation

A defensible, all-open-source starting point:

Storage: S3 (or MinIO) + Apache Iceberg + Lakekeeper REST catalog
Engines: DuckDB embedded, StarRocks for MPP, Trino for federation. Add RisingWave only when you have a streaming materialised-view use case.
Orchestrator: River on Postgres for most teams. Temporal if you have real workflow needs.
API: Go service using Connect-Go, with SSE and webhook delivery.
Auth and policy: OIDC at the gateway, OPA for per-tenant policy decisions.
Observability: OpenTelemetry to Tempo or Jaeger, Prometheus to Grafana, OpenLineage to Marquez.

More moving parts than DungBeetle's two. Every one of them earns its keep by removing a real constraint the original could not solve.

The Bottom Line

DungBeetle had the right idea five years too early. "Compute heavy, slice cheap, keep them on separate infrastructure" is correct. What was wrong is that both halves lived on RDBMS. Move the heavy compute onto a router that picks the right engine per query, move the slice-cheap layer onto Iceberg plus embedded DuckDB reads, and put a durable orchestrator underneath. You keep the design intent, drop every limitation that aged badly, and end up with something that fits how data platforms are actually built in 2026. If you are weighing a rebuild like this, our team is happy to pressure-test the trade-offs with you: start a conversation.

FAQ

What problem does DungBeetle solve?

It runs heavy SQL reports as background jobs, caches the result, and lets users filter, sort, and paginate the cached result cheaply without re-running the expensive query against the primary database. The compute is paid once. The slicing is cheap.

Why replace the results database with Apache Iceberg?

The "one SQL table per job" pattern causes catalog bloat, makes every job a DDL event, forces lossy type coercion, and has no real TTL. Iceberg replaces all of that: one table per report family, retention via partition-scoped deletes plus snapshot expiration (a maintenance job you run, not a free TTL), native nested types, time travel, branching as a pipeline capability, and horizontal read scaling off object storage.

How does the per-query engine router decide which engine to use?

At task-load time it runs EXPLAIN to estimate bytes scanned, partition pruning, join count, and whether non-Iceberg sources are referenced. Federation goes to Trino, very large scans go to StarRocks, everything else goes to DuckDB. Analysts can override with an engine_hint directive, and the router learns from observed p95 runtimes.

Should every team rebuild DungBeetle this way?

No. If reports are tiny with high QPS, if the team is small, if the workload is bursty, or if queries are OLTP single-row lookups, the original single-binary DungBeetle is still the better choice. This rebuild pays off for multi-tenant analytics platforms with diverse, heavy, federated reporting workloads.

Is the migration risky?

It is designed to be reversible. Dual-write to Iceberg alongside the existing results DB, validate checksums, cut readers over, then replace the orchestrator, then enable routing. Each phase is independently shippable and individually reversible, with no big-bang cutover.

Re-Architecting DungBeetle on Apache Iceberg, With a Per-Query Engine Router