Should you be using DuckLake?

DuckLake is new and shiny. It stores your lakehouse metadata in a proper SQL database while your Parquet files stay where they are. That means fast planning time, real ACID guarantees, and easy schema edits. The flip side is that you must care for that catalog or you lose your map. Ready? Let us dive in.

The lakehouse metadata headache

A modern data lake often holds thousands of Parquet files. To make those files behave like a single table we layer on formats such as Iceberg or Delta. They write logs in JSON or Avro that explain every snapshot and every file. Over months these logs grow. Each query starts with a scan of many tiny metadata files. Scan time climbs and so does latency.

DuckLake tackles the problem from a different angle. Instead of writing logs to object storage it writes rows into ordinary SQL tables. The query planner can now grab one indexed row rather than walk through a pile of manifests. Less I/O means faster starts on a laptop and less metadata churn in the cloud.

How DuckLake Works: SQL Catalog + Parquet

Getting started is short enough to post here.

-- inside DuckDB or Postgres
INSTALL ducklake

ATTACH 'ducklake:metadata.duckdb' AS mylake (DATA_PATH 'parquet_folder/')

From that moment you use normal SQL.

CREATE TABLE sales(id INT, amount DOUBLE)
INSERT INTO sales VALUES (1, 99.0)

-- travel to an old version
SELECT * FROM sales AT (VERSION => 0)

Under the hood there are three key tables:

transactions keeps one row per commit
snapshots points to the version you ask for
partitions maps every Parquet file to its rows

A write becomes a single transaction done by the database engine. The extension copies or rewrites Parquet files if needed and updates the metadata rows. No manifest logs get created so you avoid the small file problem entirely.

Core features told in plain words

ACID and concurrency The catalog engine locks rows just like any other database. Two analysts can insert data at the same time without stepping on one another.

Snapshot time travel Every commit is a snapshot. Add AT (VERSION => N) to your query and you read the lake as it looked back then. Great for audits or “what just broke” moments.

Schema evolution Need a new column? Run ALTER TABLE. DuckLake records the change in the catalog and carries on. No log rewrite marathon.

Fast metadata lookups Instead of scanning logs the engine reaches for an index. A cold start that once took seconds can drop to milliseconds.

Compatibility Data files stay as plain Parquet so Spark, pandas, or any tool that reads Parquet can still touch the raw files. They simply lose the extra features unless they understand DuckLake. Read more about how Parquet became a standard data format

Performance story and trade-offs

Benchmarks on a local laptop show that table discovery is often ten times faster than with manifest based formats when the table has many snapshots. The total storage footprint of metadata also shrinks because we store rows not files.

This speed comes at a cost. The catalog is now a single point of truth. If you forget backups and the disk fails you will still have Parquet files but no clue which belong to which version. Treat the catalog like any other critical database with regular dumps or replication.

DuckLake in context

Feature	DuckLake	Iceberg	Delta	Plain Parquet
Metadata home	SQL tables	JSON or Avro logs	JSON logs	None
ACID writes	✅	✅	✅	❌
Time travel	Yes with `AT VERSION`	✅	✅	❌
Add or drop column	`ALTER TABLE`	Supported	Supported	❌
Small update cost	Low	Medium	Medium	N/A
Multi writer story	Handled by DB locks	Extra service helps	Extra service helps	None

Good matches and poor matches

DuckLake shines for teams that are already happy with DuckDB and Parquet but now need versioning and safe concurrent writes. It suits side projects that start on a laptop and later move metadata to a small Postgres instance with no change in data layout. It also helps use cases that generate many tiny updates because it will not spray new manifest files on every commit.

Picture a small data team that spends most mornings waiting for yesterday’s parquet drop to load before they can even start digging for insight. With DuckLake they wire the folder to a catalog, run a single INSERT job, and sit back while the database locks in the snapshot. They are free to query right away because the commit is atomic and nothing is half written. Later that week an analyst breaks a dashboard by pushing a bad batch. Instead of rummaging through hundreds of files they add AT (VERSION => 42) to the query and watch the dashboard spring back to life, then compare the good snapshot with the bad one to see exactly what went wrong. A month later the same project grows beyond a single laptop. The team copies its parquet files to S3, moves the catalog into a tiny Postgres instance, and keeps the exact same SQL scripts. No re-engineering marathon, just a quiet handoff from local to cloud.

Who DuckLake fits best

DuckLake feels most at home with people who love the freedom of open storage but still want the safety of a database. Think of the three-person analytics squad that cannot justify Snowflake fees, yet refuses to run pipelines that might corrupt yesterday’s numbers. It suits engineers fed up with manifest bloat in Iceberg when every micro batch creates more tiny files than actual data. It also clicks with notebook users who treat DuckDB as their pocket warehouse and now need proper version control and schema tweaks without extra tools. The only real ask is a bit of basic DBA discipline, the kind that keeps a Postgres backup script in cron and checks disk space once a week. If that sounds manageable, DuckLake will probably feel like a breath of fresh air.

Current gaps and the path forward

DuckLake sits at version 0.1, so expect rough edges. There is no automatic import from Iceberg yet. Glue, Trino, and BI tools are unaware for now. The documentation warns that catalog backups are on you. The team at DuckDB plans smoother migrations and more connectors over the next year. Early adopters will drive that roadmap.

DuckLake is a fresh take on the lakehouse idea. By trusting a SQL catalog it removes the pain of scanning countless manifests and unlocks true transaction semantics. The price is that you must protect the catalog like any database. If you want quick local experiments, a simple learning curve, and less metadata bloat then give it a try. Install the extension, attach a folder, and make a table. Send feedback to the DuckDB crew and help shape the future roadmap. Happy versioning.