1.188 Version-Controlled & Branchable Databases#

Survey of version-controlled and branchable database tools implementing git semantics for data: Dolt (MySQL-compatible SQL with branches/commits), LakeFS (S3-compatible data lake versioning), Nessie (Iceberg catalog versioning), and Splitgraph. Primary focus on Dolt, used in Gas Town infrastructure as the Beads backing store. Covers Prolly Tree storage architecture, MySQL wire protocol compatibility, branch/merge semantics, DoltHub integration, LakeFS pipeline isolation patterns, and Nessie’s role in the Apache Iceberg ecosystem.


Explainer

Domain Explainer: Why Would a Database Need Version Control?#

Topic: 1.188 — Version-Controlled & Branchable Databases Audience: Developers and engineers new to this concept Date: 2026-03-04


The Problem with Mutable Data#

Imagine you are working on a spreadsheet with a colleague. You both have a copy, and you are both making changes. At the end of the day, you need to combine your work. How do you know what changed? How do you avoid overwriting each other’s edits?

Now imagine that spreadsheet has a million rows, is being edited simultaneously by a dozen people, and you need to be able to answer the question “what did this spreadsheet look like three weeks ago?” at any time.

This is the problem that databases face, and it is a problem that most databases are not designed to solve.

A traditional database is built for one purpose: being current. When you update a row, the old version is gone. When you delete a record, it is gone. When a schema migration changes the shape of a table, the old shape is gone. This works well for transactional applications — when a customer places an order, you want the inventory to reflect that immediately, with no ambiguity. But it creates real problems for any use case where history matters.


The Analogy: Track Changes in a Document#

Microsoft Word has a feature called “Track Changes.” When it is enabled, every edit you make is recorded: deleted text is shown with a strikethrough, added text is highlighted, and the history of who changed what is preserved. You can accept or reject individual changes. You can compare two versions of the document side by side.

Version control for databases applies the same idea, but to an entire database — all of its tables, all of their rows — at once. Instead of tracking individual character changes in a document, it tracks row-level changes across tables. Instead of “author made this change at 2pm,” it records “user X ran this query and modified 47 rows at 2:14pm.”

The document analogy breaks down in one important way: documents are rarely worked on by more than a few people at once, and conflicts are resolved by humans reading the text. A database might be modified by hundreds of application threads simultaneously, and conflicts need to be resolved by the system itself.


How Git Solved This Problem for Code#

Software developers have been solving this problem for code since the 1970s, with the current dominant solution being git (created in 2005). Git’s approach has four key concepts:

Commits: A commit is a snapshot of the entire codebase at a specific moment in time. Every commit has a unique identifier (a hash), a message describing the change, and a reference to the previous commit. The history of commits forms an unbroken chain.

Branches: A branch is a pointer to a commit. Creating a branch is free and instant — it just creates a new pointer. Work on a branch is isolated from work on other branches. You can experiment freely without affecting anyone else’s code.

Diff: Given any two commits, you can see exactly what changed: which lines were added, which were deleted, in which files. This makes code review possible.

Merge: When two branches diverge and you want to combine them, git performs a merge. It looks at the common ancestor (where the branches split), the changes made in each branch, and combines them. If the same lines were changed differently in each branch, you have a conflict to resolve.

These four concepts — commits, branches, diff, merge — are what version-controlled database tools apply to data.


Why This Is Harder for Databases#

Applying git semantics to a database is significantly more difficult than applying them to text files. Here is why:

Structure matters: A text file is a sequence of lines. Changing line 42 is simple to represent. A database table is a collection of rows identified by primary keys. “Row with id=42 had its price column changed from 9.99 to 12.99” is a structured change that needs to be stored and compared in a structured way.

Scale matters: A typical git repository contains thousands of files with thousands of lines each. A typical database table contains millions to billions of rows. Naive approaches that scan entire tables to find what changed are too slow to be practical.

Transactions matter: Databases allow many operations to happen “at once” in a transaction. Version control needs to respect transaction boundaries — a commit should represent a consistent state, not a mid-transaction snapshot.

Schema changes matter: When a developer alters a table — adds a column, changes a data type, renames a column — this affects every row in the table. Merging schema changes from two branches is conceptually harder than merging text changes.

The tools in this space have each solved these challenges in different ways, with different trade-offs.


Three Categories of Solutions#

Row-Level Versioning (Example: Dolt)#

The deepest solution: the database’s storage engine is redesigned to record the history of every row. This requires replacing how the database stores data internally.

Dolt uses a data structure called a Prolly Tree — a variant of a B-tree (the traditional database storage structure) that is content-addressed. “Content-addressed” means each piece of storage is identified by a hash of its content, not by a memory address or file offset. Two identical pieces of data produce the same hash and are stored only once.

This makes diffing efficient: to find what changed between two database states, you compare their content hashes. If a subtree of the Prolly Tree has the same hash in both states, nothing in that subtree changed — skip it. Only descend into subtrees where the hash differs. For a small change in a large table, this is dramatically faster than scanning every row.

The result: you can ask questions like “show me every row that changed in the orders table between yesterday and today” and get an answer in seconds, even for a table with 100 million rows, because the Prolly Tree diff only visited the rows that actually changed.

File-Level Versioning (Example: LakeFS)#

A shallower solution that operates at the level of files rather than rows. Many modern data systems store their data in files (Parquet, ORC, JSON) in object storage like Amazon S3. LakeFS sits in front of S3 and tracks the history of which files exist at which paths.

A LakeFS “branch” is a view of the file namespace: it remembers which files were at which paths when the branch was created, and records changes (new files, deleted files, updated files) as commits. Two branches can have different sets of files, and you can merge them (accept files from both branches into a combined view).

This approach does not know anything about the content of the files — a Parquet file is just a blob to LakeFS. But for data pipeline workflows where data is stored as files, this is exactly the right level of abstraction. The files are the unit of work.

Metadata/Catalog Versioning (Example: Nessie)#

The shallowest solution: versioning only the metadata that describes where the data is, not the data itself.

Apache Iceberg is a table format: Iceberg tables consist of data files (Parquet, ORC) plus metadata files that describe which data files belong to the table and what their contents look like. A “catalog” is a service that maps table names to their current metadata file.

Nessie versions the catalog. A Nessie commit records “at this point in time, table customers had metadata file X, table orders had metadata file Y.” Creating a branch is just creating a new catalog state pointer. The actual data files are not touched at all.

This makes Nessie operations extremely lightweight — branching an Iceberg database with terabytes of data costs nothing in storage or compute, because only the catalog pointers are versioned. The trade-off: Nessie cannot tell you what rows changed between two versions, only which metadata files changed.


When Is This Worth Using?#

Version control for databases is worth the added complexity when:

Reproducibility matters: If you need to reproduce an exact computation from three months ago — a model training run, a regulatory report, a financial reconciliation — and the underlying data may have changed, you need to be able to retrieve the data as it existed at the time.

Experimentation is needed: If you want to try a schema change, a data transformation, or a batch update without risking the production state, branching gives you a safe sandbox that can be discarded without affecting anyone else.

Collaboration produces conflicts: If multiple people or systems are writing to the same data store and their changes could conflict, merge semantics give you a structured way to combine their work with explicit conflict detection.

Auditing is required: If you need to answer “who changed this record, and when, and what was the value before?” — whether for compliance, debugging, or operations — built-in version history is far more reliable than manually implemented audit logging.


When It Is NOT Worth Using#

Switching to a version-controlled database is not always the right answer:

High-throughput OLTP: If you are running a high-volume transaction processing system (payment processing, order management at scale), the overhead of version control may be unacceptable. Traditional databases are optimized for maximum throughput; version-controlled databases trade some throughput for history.

You just need operational PITR: Every major database supports point-in-time recovery (PITR) for operational purposes — restoring the database to its state before a hardware failure. This is different from “I want to query what the data looked like at an arbitrary past time.” If you only need operational recovery, you don’t need a version-controlled database.

The team is not ready for the conceptual overhead: Branches, commits, merges, and conflicts are powerful concepts, but they add complexity. A team that has not internalized git’s mental model for code will face a steeper learning curve applying it to data. The benefit needs to outweigh the training and operational overhead.

Your data is truly append-only: Some datasets only ever grow — new rows are added, old rows are never modified. In this case, a timestamp column gives you equivalent history without the overhead of a version-controlled storage engine.


The Core Mental Model#

If you take away one thing from this explainer: a version-controlled database is to data what git is to source code. It adds four capabilities that plain databases lack:

  1. A permanent, addressable history of every change.
  2. The ability to diverge from the main state (branch) and work in isolation.
  3. The ability to see exactly what changed between any two states (diff).
  4. The ability to combine diverged states back together (merge).

These capabilities solve real problems in data engineering, machine learning, compliance, and collaborative data work. The tools implementing them are young compared to git (which has 20 years of maturation), but they are mature enough for production use in many scenarios.

The decision of whether to adopt one is fundamentally the same question as deciding whether a project needs git: if your data changes, matters, and is worked on by more than one person or system — it probably does.

S1: Rapid Discovery

S1: Rapid Discovery — Version-Controlled & Branchable Databases#

Topic: 1.188 — Version-Controlled & Branchable Databases Stage: S1 Rapid Discovery Date: 2026-03-04


Quick Answer#

If you need a SQL database with git-style branching, use Dolt. It is the only tool in this space that gives you a MySQL-compatible SQL interface with full branch/commit/merge/diff semantics on relational table data. If you are versioning files in object storage (S3/GCS), use LakeFS. If you are managing Iceberg or Delta Lake table metadata, use Nessie. Splitgraph is technically interesting but has stalled in community adoption.

For most teams encountering this problem space for the first time, Dolt is the right starting point: it behaves like MySQL, installs as a single binary, and adds version control without requiring infrastructure changes beyond swapping your database server.


What Problem Does This Space Solve?#

Databases are, by design, stateful and mutable. You write a row; the old version is gone. You run a migration; the schema before it is gone. This mutability is fine for serving applications, but it creates serious problems for:

  • Data pipelines: You cannot reproduce last month’s model training run because the source table has changed.
  • Schema migrations: You cannot cheaply experiment with a schema change in isolation before committing it to production.
  • Auditing: You cannot answer “what did this table look like on January 15th?” without bespoke audit logging.
  • Collaboration: Two data engineers cannot work on transformations in isolation and then merge their work.

Version control for code solved these problems for source files: git gives you branches, commits, diffs, and merges. “Git for data” tools apply the same semantics to stored data. The challenge is that databases are not flat files, so the implementation is significantly more complex than git’s content-addressed blob store.


The Four Main Contenders#

Dolt#

GitHub: dolthub/dolt | Stars: ~18,000 | Language: Go | License: Apache 2.0

Dolt is the flagship implementation of git-for-databases. It presents itself as a drop-in MySQL replacement: connect with any MySQL client, use standard SQL, and additionally get dolt_branch(), dolt_commit(), dolt_diff(), and related SQL procedures. The storage engine is built on Prolly Trees, a content-addressed B-tree variant that makes diffing and merging efficient without scanning entire tables.

Dolt’s community position is strong. It has the most GitHub stars in the space, an active blog with benchmarks, and a SaaS hosting platform (DoltHub) that functions like GitHub for databases. The team is well-funded (DoltHub Inc.) and the project has shipped consistently for several years.

Trade-offs: Dolt is approximately 2x slower than vanilla MySQL on typical workloads due to the overhead of the Prolly Tree storage format. This is acceptable for most use cases where you are not running at MySQL’s performance ceiling. Dolt is not a drop-in for PostgreSQL — it uses the MySQL wire protocol, which is a meaningful constraint.

Community consensus: The go-to tool for teams that want version control semantics on relational data. Actively recommended in data engineering and ML communities.

LakeFS#

GitHub: treeverse/lakeFS | Stars: ~4,500 | Language: Go | License: Apache 2.0 (core) / commercial tiers

LakeFS does not touch SQL at all. It sits in front of object storage (S3, GCS, Azure Blob) and implements git-like branching over the objects stored there. A “branch” in LakeFS is a pointer to a snapshot of the object namespace; writes to a branch are isolated until merged. It exposes an S3-compatible API, so tools that already write to S3 (Spark, Flink, dbt) can be pointed at LakeFS with minimal changes.

Trade-offs: LakeFS is powerful for data lake workflows but does not help with SQL-level operations. You are versioning files, not rows. Schema changes within a Parquet file are invisible to LakeFS. The commercial/open boundary has shifted over time, which has caused some community wariness about lock-in risk.

Community consensus: Well-regarded in the data lake / lakehouse space. Preferred by teams already operating on object storage at scale. Not relevant to SQL database use cases.

Nessie#

GitHub: projectnessie/nessie | Stars: ~900 | Language: Java | License: Apache 2.0

Nessie is a catalog service for Apache Iceberg and Delta Lake tables. Where LakeFS versions objects, Nessie versions the metadata that describes where those objects are. A Nessie “commit” records a new snapshot of the Iceberg table metadata pointers. This makes it exceptionally lightweight: branching and merging only manipulates catalog records, not data files.

Trade-offs: Nessie requires an Iceberg or Delta Lake ecosystem to be useful. It does not stand alone. It is primarily a Dremio project, which raises questions about long-term stewardship neutrality. Community activity is moderate; real-world adoption is growing as Iceberg becomes mainstream.

Community consensus: Increasingly seen as the right answer for Iceberg-native data platforms. Less known outside the Spark/Dremio/Flink ecosystem.

Splitgraph#

GitHub: splitgraph/splitgraph | Stars: ~1,400 | Language: Python | License: Apache 2.0

Splitgraph layers versioned dataset semantics on PostgreSQL using a system of “images” (snapshots) and “layers” (diffs). It also offers a data sharing platform. Development has become notably less active since 2022, and the team appears to have pivoted focus. The PostgreSQL-native approach is conceptually appealing but lacks the engineering investment that Dolt has received.

Trade-offs: PostgreSQL compatibility is a genuine advantage over Dolt if your stack is Postgres-first. However, the project’s reduced activity is a meaningful risk. The data sharing platform (data.splitgraph.com) continues to operate but has not meaningfully grown.

Community consensus: Interesting but stalled. Not recommended for new projects unless the PostgreSQL requirement is hard and the team is prepared to maintain the tooling themselves.

XetHub / XethHub#

GitHub: XetHub (acquired by Hugging Face) | Stars: ~500 (pre-acquisition)

XetHub extended git to handle large binary files and datasets, integrating with the standard git CLI rather than replacing it. After acquisition by Hugging Face in 2023, the technology was folded into Hugging Face’s dataset hosting infrastructure. As a standalone tool it is effectively discontinued, though the ideas live on in Hugging Face’s platform.

Not recommended as a primary tool. Mentioned for completeness and because it represents an alternative design approach (git extension vs. new database).


Key Differentiator: What Are You Versioning?#

The most important selection criterion is the data model:

ToolWhat It VersionsQuery InterfaceBest For
DoltRelational table rowsMySQL SQLSQL databases, config/state stores
LakeFSObject storage filesS3 APIData lakes, Parquet/ORC files
NessieIceberg/Delta metadataREST + catalog APILakehouse platforms
SplitgraphPostgreSQL rowsPostgreSQL SQLPostgreSQL (if you accept the risk)

Community Health Summary#

Dolt is the clear leader in community engagement in this space: active GitHub issues, a maintained changelog, a public benchmark suite, and a team that writes technical blog posts about their implementation choices. DoltHub.com provides a discovery layer for public datasets hosted on Dolt.

LakeFS has a solid community, particularly in the data engineering / DataOps space. They have a Slack community and regular releases.

Nessie is primarily driven by Dremio contributors. External contribution is limited but growing as Iceberg adoption rises.

Splitgraph community activity has declined sharply. PRs go unreviewed; issues accumulate.


When NOT to Use Any of These#

  • You just need audit logging: A trigger-based audit table or temporal tables (PostgreSQL’s AS OF SYSTEM TIME) may be simpler than switching databases.
  • You need point-in-time recovery: All major databases already support this via WAL/binlog. You don’t need a git-for-databases tool for operational PITR.
  • You are at extreme scale: None of these tools are designed for petabyte-scale OLTP. LakeFS comes closest to scale at the object storage tier.
  • You need sub-millisecond latency: Dolt’s overhead over MySQL is real. At the high-performance end, use vanilla MySQL/Postgres.

Bottom Line for Practitioners#

For most practitioners encountering this space: start with Dolt if your data is relational. It is the most mature, best-documented, and most actively developed option. The MySQL compatibility means your existing tooling works. The git semantics are implemented completely enough that branch/merge workflows actually function in production.

If you operate a data lake on object storage and want reproducibility for your Spark pipelines, LakeFS is the right choice. If you have committed to Apache Iceberg, Nessie is worth evaluating alongside or instead of LakeFS (they solve adjacent but distinct problems).

Avoid Splitgraph for new projects.

S2: Comprehensive

S2: Comprehensive Discovery — Version-Controlled & Branchable Databases#

Topic: 1.188 — Version-Controlled & Branchable Databases Stage: S2 Comprehensive Discovery Date: 2026-03-04


Overview#

This stage dives deep into the implementation architecture of the four primary tools in the version-controlled database space. The heaviest coverage is on Dolt, which is used in Gas Town infrastructure as the backing store for the Beads issue-tracking system. Understanding Dolt’s internals is directly relevant to operating and debugging that system.


Dolt: Deep Technical Dive#

The Core Innovation: Prolly Trees#

Dolt’s defining technical contribution is the Prolly Tree (Probabilistic B-tree), the storage structure that makes efficient diffing and merging of large relational datasets practical.

A conventional B-tree organizes keys into a balanced tree optimized for range scans and point lookups. The tree structure is determined by insertion order and rebalancing decisions. Two B-trees with identical data but different insertion histories can have completely different internal structures. This makes it impossible to diff two B-trees without scanning every leaf node — an O(n) operation for any comparison.

Prolly Trees solve this by making the tree structure content-determined. The boundaries between tree chunks are chosen based on the data content using a rolling hash function (similar to how rsync finds chunk boundaries). This means two trees with mostly identical data will have mostly identical internal structure. The diff algorithm can walk both trees simultaneously, skipping subtrees that hash identically and descending only into subtrees that differ. For a small change to a large table, this produces near-O(changed rows) diff performance rather than O(total rows).

Content-addressing extends to every node in the tree. Each chunk is stored with a hash of its content, and parent nodes reference children by hash. This is structurally similar to git’s DAG of content-addressed blobs, trees, and commits. When you create a Dolt branch, you are creating a new pointer (a ref, in git terms) that initially points to the same root hash as the current commit. Writing to the branch produces new chunks only for the changed data; unchanged chunks are shared.

This architecture means:

  • Branching is O(1): Just create a new ref pointing to the current root hash.
  • Storage is deduplicated: Identical data across branches shares storage.
  • Diff is sub-linear: Proportional to changed data, not total data.
  • Merge is possible: The three-way merge algorithm works on the tree structure directly.

Dolt’s Commit Graph#

Dolt maintains a commit graph isomorphic to git’s. Each commit object records:

  • A pointer to the root of the Prolly Tree at that commit (the “table set”)
  • A list of parent commit hashes (one for regular commits, two for merges)
  • Author, committer, timestamp, and message metadata
  • A unique commit hash (SHA-256 based)

Branches are named refs pointing to commit hashes, stored in a ref store (backed by the same content-addressed chunk store). Tags are immutable refs. The HEAD ref tracks the current working branch.

The working set — uncommitted changes — is tracked in a separate structure that Dolt calls the “working set”. It records the state of all tables as they exist before a dolt commit, separately from the committed graph. This is analogous to git’s index (staging area) plus working directory.

MySQL Wire Protocol Compatibility#

Dolt implements the MySQL wire protocol, which means any MySQL client — mysql CLI, MySQL Workbench, MySQL Connector/J, Python’s mysql-connector-python, Go’s go-sql-driver/mysql — can connect to a Dolt server without modification. The server listens on port 3306 by default (configurable).

Compatibility is not 100%. As of early 2026, Dolt supports:

  • Full DML: SELECT, INSERT, UPDATE, DELETE, REPLACE
  • Most DDL: CREATE TABLE, ALTER TABLE, DROP TABLE, indexes, foreign keys
  • Stored procedures and functions (partial; complex stored procedures may fail)
  • Views
  • Transactions (BEGIN, COMMIT, ROLLBACK) — with the caveat that Dolt transactions map to Dolt working set state, not the traditional MVCC isolation model

Dolt-specific functionality is exposed through SQL stored procedures and system tables:

  • CALL dolt_branch('feature-x') — create a branch
  • CALL dolt_checkout('feature-x') — switch working branch
  • CALL dolt_commit('-m', 'message') — commit working changes
  • CALL dolt_merge('feature-x') — merge a branch
  • SELECT * FROM dolt_log — view commit history
  • SELECT * FROM dolt_diff_<tablename> — view row-level diffs between commits

This means version control operations are SQL operations. There is no separate CLI required for an application using Dolt as a library or server; all branching and versioning happens through the standard MySQL connection.

The Dolt CLI#

In addition to the SQL server mode, Dolt ships a git-like CLI:

dolt init
dolt checkout -b feature-x
dolt sql -q "INSERT INTO mytable VALUES (1, 'hello')"
dolt add mytable
dolt commit -m "add hello row"
dolt diff main
dolt merge feature-x

The CLI commands are intentionally parallel to git. This makes the learning curve shallow for developers already familiar with git. The CLI operates on a local Dolt database directory (structured similarly to a git repo — a .dolt/ directory in the working directory).

Branch and Merge Semantics#

Dolt branch/merge operates at the row level. A three-way merge computes:

  1. The common ancestor commit (the merge base)
  2. The diff between the merge base and the current branch (ours)
  3. The diff between the merge base and the incoming branch (theirs)

Row-level conflicts occur when the same primary key was modified differently in both branches. Schema conflicts occur when ALTER TABLE statements were applied differently. Dolt surfaces conflicts in the dolt_conflicts_<tablename> system table, where they can be inspected and resolved via SQL UPDATE or DELETE statements, followed by CALL dolt_resolve_all_conflicts() or manual resolution.

This is conceptually identical to git’s three-way merge, applied to structured data instead of text files. The practical difference is that row-level merges are usually cleaner than text merges (there is no “hunk” concept — either two edits touch the same primary key or they don’t).

Performance vs. Vanilla MySQL#

DoltHub publishes benchmarks comparing Dolt to MySQL on standard benchmarks (sysbench). The overhead has improved significantly over Dolt’s development history. Current figures (as of early 2026) show Dolt running at approximately 40-60% of MySQL’s speed on write-heavy workloads and 60-80% on read-heavy workloads.

The overhead sources are:

  1. Prolly Tree write amplification: Writing a row means updating the chunk containing that row, then updating all parent chunks up to the root. This is more expensive per write than a conventional B-tree update.
  2. Content hashing: Every chunk write requires computing a hash.
  3. Working set tracking: Dolt must maintain the uncommitted working set separately from the committed state.

For the Gas Town Beads system, this overhead is irrelevant: the Beads database handles tens to hundreds of operations per day, nowhere near the performance ceiling. The version control benefits far outweigh the performance cost at this workload level.

DoltHub Integration#

DoltHub (dolthub.com) is a SaaS platform for hosting Dolt databases, functioning as GitHub for data. Features include:

  • Push/pull of Dolt databases (analogous to git push origin main)
  • Web UI for browsing tables, commit history, diffs, and branches
  • Pull request workflow for data changes
  • Public dataset hosting (open data sharing)
  • Access controls for private databases

A Dolt database can be pushed to DoltHub and pulled by any authorized user, enabling a full collaborative workflow. This is used in practice for open datasets where DoltHub hosts the canonical version and contributors submit data pull requests.

Go Library Usage#

Dolt is written in Go and exposes a Go library (github.com/dolthub/dolt/go/libraries/doltcore). This library provides programmatic access to Dolt databases without going through the SQL server. Applications embedded in Go can open a Dolt database directory directly, manipulate tables, create commits, and read diffs without network overhead.

The Beads system in Gas Town infrastructure uses Dolt in server mode (connecting via the MySQL wire protocol), not the Go library directly. This is the more common pattern for polyglot environments.


LakeFS: Technical Architecture#

Design Philosophy#

LakeFS is built around a single insight: for data lake workflows, the unit of versioning should be the namespace of objects in object storage, not the objects themselves. A “commit” in LakeFS is a snapshot of the logical namespace — which objects exist, under what keys — without copying the objects.

Storage Architecture#

LakeFS maintains its own metadata store (backed by PostgreSQL, DynamoDB, or an embedded KV store depending on deployment) that maps the logical namespace at each commit. Actual data files live in object storage as-is; LakeFS never copies them. When you write a file to a LakeFS branch, the file goes to a staging area in object storage, and the LakeFS metadata records “this key now points to this object in the staging area.” On commit, the staging area is promoted and the metadata updated atomically.

Branches in LakeFS are cheap: creating a branch just creates a new metadata pointer at the current commit. No data is copied. Two branches that share 10 TB of Parquet files cost nothing extra in storage — the files exist once and both branches reference them.

Merging in LakeFS merges the object namespace: keys modified in the source branch that were not modified in the target branch are applied. Conflicts arise when the same key was modified differently in both branches; resolution is currently limited (delete one or the other; there is no content-level merge of data files).

S3-Compatible API#

LakeFS exposes an S3-compatible REST API. Tools that can write to S3 can be pointed at LakeFS by changing the endpoint URL and bucket naming convention. A LakeFS path looks like s3://my-repo/main/path/to/file.parquet where my-repo is the LakeFS repository and main is the branch.

This S3 compatibility is the key to LakeFS’s adoption: existing Spark jobs, dbt runs, and Airflow pipelines can be pointed at LakeFS instead of raw S3 with minimal changes. The branch becomes the isolation unit for a pipeline run.

Data Pipeline Isolation Pattern#

The canonical LakeFS usage pattern for reproducible pipelines:

  1. Create a branch from main for the pipeline run.
  2. Run the pipeline, writing output to the branch.
  3. Validate outputs.
  4. If validation passes, merge the branch to main.
  5. If validation fails, discard the branch.

This gives pipelines atomic, isolated writes. The main branch always contains validated data; failed pipeline runs are discarded without polluting the main namespace. This is fundamentally the same workflow as feature branches in git.

Commit Model#

A LakeFS commit records:

  • A snapshot of the namespace at commit time
  • Parent commit reference(s)
  • Commit message and metadata
  • A unique commit ID

The commit history is a DAG, same as git. git log-equivalent queries return the list of commits on a branch. Point-in-time reads work by specifying a commit ID instead of a branch name in the S3 path.

LakeFS vs. Dolt#

LakeFS and Dolt are not competitors — they solve different problems at different layers. LakeFS versions the files; Dolt versions the rows. A team could theoretically use both: LakeFS to version Parquet files in a data lake, and Dolt as the SQL query layer for a specific relational dataset. In practice, teams choose one or the other based on whether their primary data format is files (LakeFS) or SQL tables (Dolt).


Nessie: Technical Architecture#

Catalog Versioning for Iceberg#

Apache Iceberg defines a table format: tables are described by metadata files that point to data files (Parquet, ORC, Avro). The catalog is the service that maps table names to metadata file locations. Different catalog implementations exist: Hive Metastore, AWS Glue, and — for versioned workflows — Nessie.

Nessie versions the catalog: each commit records the state of the entire catalog (which tables exist, what their current metadata pointer is). Branching and merging operate on this catalog state.

Architecture#

Nessie is a REST API server. It exposes endpoints for:

  • Creating and listing branches and tags
  • Committing catalog changes (new table pointer, dropped table, etc.)
  • Reading the catalog state at any branch or commit
  • Merging branches

The Nessie server itself is stateless in the request-handling layer; state is persisted in a configurable backend (JDBC-compatible databases like PostgreSQL, in-memory for testing, or DynamoDB). The server is a Java application distributed as a Docker image or JAR.

Iceberg Integration#

Iceberg’s catalog abstraction supports pluggable catalog implementations. Pointing an Iceberg client at Nessie requires configuring the catalog type to nessie and providing the Nessie server URL. After that, all Iceberg operations (creating tables, running queries, running Spark jobs) go through Nessie’s versioned catalog.

A Spark job writing to an Iceberg table via Nessie:

  1. Creates a branch in Nessie.
  2. Writes data via Spark, which updates the Iceberg table metadata.
  3. Nessie records the new metadata pointer as a commit on the branch.
  4. After validation, the branch is merged to main.

Commit Model and Conflict Detection#

Nessie commits are lightweight: they only record metadata pointer changes, not data. A table with 1 TB of Parquet data produces a Nessie commit that is a few hundred bytes (the new metadata file pointer). This makes Nessie branches extremely cheap.

Conflict detection is at the table level: if two branches modify the same table (update its metadata pointer), Nessie detects a conflict on merge. Resolution is manual (accept one branch’s version of the table). There is no sub-table merge (row-level conflicts are invisible to Nessie, which only sees metadata pointers).

Nessie vs. LakeFS#

Nessie and LakeFS are often discussed together but are architecturally distinct:

  • LakeFS versions the object namespace (the files themselves, at the S3 level).
  • Nessie versions the catalog (the metadata that describes what the tables are).

In a lakehouse architecture, you could run both: Nessie versions which metadata file describes table X at a given commit, and the metadata file itself describes which Parquet files contain the data. LakeFS would version the Parquet files. In practice, most teams choose one versioning layer.

Nessie is the better choice when you are already committed to the Iceberg table format and want versioning at the semantic (table) level. LakeFS is better when you want versioning at the file level without requiring Iceberg.


Splitgraph: Technical Overview#

Architecture#

Splitgraph wraps PostgreSQL with a versioning layer implemented as a PostgreSQL extension plus a Python client. Data is stored in PostgreSQL tables; Splitgraph tracks changes using audit triggers and records “images” (snapshots) of table contents. Images can be exported as “layers” (diffs from a previous image), which can be pushed to a registry and pulled by other users.

Layered Storage#

Splitgraph’s storage model uses “object” chunks: a table image is decomposed into chunks, each stored as a PostgreSQL object. Objects are content-addressed. An image records a set of object references for each table. This is conceptually similar to Dolt’s Prolly Tree approach but implemented at a higher level (PostgreSQL objects rather than a custom storage engine).

Why It Has Stalled#

Splitgraph’s technical approach requires running a modified PostgreSQL workflow with audit triggers, which adds overhead and complexity. The business model pivoted to data cataloging and sharing (data.splitgraph.com), diverting engineering attention from the core versioning engine. The result is a technically interesting system that lacks the polish and active development of Dolt.


Cross-Cutting Concerns#

Three-Way Merge Complexity#

All of these tools implement some variant of three-way merge. The challenge is that structured data has more merge semantics than text:

  • Renamed columns: Was column A renamed to B on one branch and deleted on another? Dolt detects schema conflicts; LakeFS/Nessie are oblivious (they don’t look inside files).
  • Referential integrity: Merging foreign-key-constrained tables can produce temporarily invalid states. Dolt handles this within its constraint engine.
  • Type changes: Changing a column from INTEGER to VARCHAR on one branch is incompatible with queries on another branch that assume INTEGER.

Dolt addresses these as part of its SQL engine. LakeFS and Nessie are fundamentally file/metadata versioners and are deliberately agnostic to content.

Operational Deployment Patterns#

Dolt can be run as a standalone server (dolt sql-server) or embedded via the Go library. Production deployments typically run Dolt in the same way you would run MySQL: a persistent process, connecting client applications via the MySQL protocol.

LakeFS is deployed as a Docker container or Kubernetes pod alongside your object storage. It requires a metadata store (typically PostgreSQL). The S3-compatible gateway is the primary interface for clients.

Nessie is deployed as a Java service. Typically runs in Docker or Kubernetes. Requires a backend database. Usually deployed as a sidecar to a Spark or Dremio cluster.

Security Model#

All three tools inherit the security model of their underlying system:

  • Dolt: MySQL user/privilege model. Users are created with CREATE USER and privileges with GRANT. TLS for wire encryption.
  • LakeFS: IAM-based access control with its own RBAC layer on top of object storage permissions.
  • Nessie: Basic/bearer token auth; more advanced IAM integration varies by deployment.
S3: Need-Driven

S3: Need-Driven Discovery — Version-Controlled & Branchable Databases#

Topic: 1.188 — Version-Controlled & Branchable Databases Stage: S3 Need-Driven Discovery Date: 2026-03-04


Overview#

This stage maps tools in the version-controlled database space to concrete personas and their real-world needs. The goal is to answer “who uses this and why?” rather than “what does it technically do?” Understanding the persona-tool fit is essential for making selection decisions in context.


Persona 1: Data Engineer Wanting Reproducible Pipelines#

Context#

A data engineer runs daily pipeline jobs that read from source tables, transform the data, and write to output tables used by analysts and ML systems. Their recurring pain: a model trained last month cannot be reproduced because the source data has changed. A dashboard that worked last week now shows different numbers. A failed pipeline job partially overwrote the output table before crashing, leaving it in an inconsistent state.

The engineer has tried approaches like writing timestamped snapshots to a separate table, but this creates storage bloat and makes query patterns awkward. They have also tried keeping raw data immutable in S3 and re-running transformations, but transformation logic also changes and the re-run does not reproduce the original output.

What They Actually Need#

  1. Atomic pipeline commits: A pipeline either succeeds and its entire output is committed, or it fails and nothing changes. No partial writes.
  2. Point-in-time reads: The ability to query “what did the output table look like when this model was trained?” by referencing a commit hash.
  3. Pipeline isolation: Two concurrently running pipelines write to isolated branches; validated outputs are merged to main.
  4. Audit trail: A log of what ran, when, and what changed.

Tool Fit#

LakeFS is the strongest fit if the pipeline works with files (Parquet, ORC, JSON) in object storage. The branch-per-pipeline-run pattern maps exactly to their needs: create a branch, run the pipeline to the branch, validate, merge or discard. The S3 compatibility means the pipeline code changes minimally.

Nessie is the right addition if the pipeline uses Apache Iceberg tables. Nessie provides atomic table-level commits and branching at the catalog layer, so a Spark job writing to Iceberg via Nessie gets equivalent isolation to LakeFS but at the table-metadata level.

Dolt applies if the pipeline output is a SQL table and the team wants to query historical states with SQL rather than by navigating file snapshots. SELECT * FROM mytable AS OF 'commit-hash' is a compelling interface for analytical investigations.

Friction Points#

Data engineers operating at scale (petabytes of data, thousands of pipeline runs per day) will find LakeFS more production-hardened than Dolt for this use case. Dolt’s Prolly Tree structure adds meaningful overhead per row write, which accumulates at high volume. LakeFS operations are proportional to the number of objects, not the data volume within objects.


Persona 2: ML Engineer Needing Dataset Versioning#

Context#

An ML engineer maintains training datasets that evolve over time: new examples are added, labels are corrected, problematic samples are filtered out. They need to train models on specific dataset versions, compare model performance across dataset versions, and reproduce exact training runs months later for auditing.

They have experienced the “dataset version mismatch” problem: a model’s performance degrades in production, and investigation reveals the training dataset was updated between the baseline and production versions — making root-cause analysis difficult.

What They Actually Need#

  1. Named dataset versions: The ability to tag a dataset state (“dataset-v1.2”) and retrieve it exactly.
  2. Diff between versions: “What changed between dataset-v1.1 and dataset-v1.2?” — which rows were added, removed, or had labels changed.
  3. Branch experiments: “What happens to model quality if I add these 10,000 examples?” — run training on a branch, then decide whether to merge.
  4. Lineage tracking: Know which model version was trained on which dataset version.
  5. Collaborative editing: Multiple annotators or data scientists can propose dataset changes via something like a pull request, with review before merge.

Tool Fit#

Dolt is the strongest fit for structured datasets (tabular data: features, labels, metadata). The diff interface (SELECT * FROM dolt_diff_training_data) answers “what changed between versions?” directly in SQL. The branch/merge workflow maps to the experimental branch use case. DoltHub provides the collaborative pull-request workflow without requiring self-hosted infrastructure.

For the ML engineer, Dolt’s workflow maps conceptually well:

  • dolt tag v1.2 corresponds to tagging a model’s training data version.
  • dolt diff v1.1 v1.2 shows exactly what changed.
  • Training code that specifies a commit hash or tag name can reproduce exact datasets.

LakeFS is the fit for ML teams storing datasets as files (images, audio, large Parquet feature stores). The commit hash as a dataset identifier works the same way. DVC (Data Version Control) is a related tool in this space that integrates with git at the file level — worth comparing, though it is not a database tool.

Nessie applies specifically to teams using Iceberg as their feature store format, which is becoming more common as Iceberg matures.

Friction Points#

The ML engineer’s workflow requires understanding that Dolt version identifiers (commit hashes) need to be propagated to model training metadata. There is no out-of-the-box MLflow or Weights & Biases integration; the engineer needs to log the Dolt commit hash as a model training artifact. This is straightforward but requires discipline.

At very large scale (hundreds of millions of training examples), Dolt’s write performance may become a bottleneck. LakeFS or a combination of Parquet files + Nessie may be more appropriate.


Persona 3: Infrastructure Engineer Using Dolt for Config/State Storage#

Context#

An infrastructure engineer manages a complex system where configuration and operational state need to be versioned, auditable, and branchable. In the Gas Town infrastructure, this is precisely the Beads system: a task and issue tracking database that needs to record changes, support branches for different workers’ views of work state, and merge changes from multiple polecats back to a shared main branch.

More generally, this persona applies to:

  • Infrastructure-as-code backing stores: Configuration tables that describe what services should exist, what their parameters are. Changes go through a review branch, merge to main triggers deployment.
  • Feature flag systems: Feature flag states versioned so that you can roll back a flag change if it caused an incident.
  • CMDB (Configuration Management Database): Inventory of infrastructure that changes over time, with full audit trail.
  • Distributed agent state: Multiple agents or workers each operating on their own branch of a shared state database, periodically merging.

The Beads Case Study#

In Gas Town, the Beads system uses Dolt as its backing store specifically to enable multi-agent coordination. Each polecat (worker agent) operates in its own git worktree with its own Dolt branch of the beads database. When a polecat completes work, it merges its beads branch back to the shared state. This gives:

  • Isolation: A polecat’s in-progress work changes do not interfere with other polecats.
  • Merge coordination: The bd sync command handles merging beads changes from a worktree back to the central state.
  • Audit trail: Every beads change is a Dolt commit with a timestamp and message, giving a complete history of what tasks were created, updated, and closed.
  • Rollback: If a polecat makes incorrect beads changes, they can be rolled back without losing other state.

What This Persona Needs#

  1. SQL interface: The application connects to Dolt like a database; no custom storage API.
  2. Branch per agent/environment: Each worker or environment has its own isolated branch.
  3. Merge operations: Changes from a branch can be merged into a central branch after validation.
  4. Conflict detection: If two agents modify the same record in conflicting ways, Dolt surfaces this rather than silently overwriting.
  5. Lightweight commits: Frequent small commits (one commit per operation) are cheap.
  6. Point-in-time queries: “What was the state of the config table 30 minutes ago?” is answerable.

Tool Fit#

Dolt is the clear choice for this persona. LakeFS and Nessie are not SQL databases and cannot serve as an application’s relational backing store. Splitgraph could theoretically work but lacks Dolt’s engineering depth and reliability.

The infrastructure engineer using Dolt gets a system that behaves like MySQL from the application’s perspective, with version control available as needed. The cost is the Prolly Tree overhead, which at config/state access patterns (low volume, not performance-critical) is negligible.

Operational Considerations for This Persona#

  • Schema migrations: Run DDL changes on a branch, validate, then merge to main. This is safer than ALTER TABLE on a production database.
  • Backup strategy: Push to DoltHub or a self-hosted Dolt remote as the backup mechanism. The push operation transfers incremental changes (not full copies), making it efficient.
  • High availability: Dolt does not yet have native primary/replica clustering comparable to MySQL Group Replication. For high-availability requirements, this is a meaningful gap. For internal tooling (like Beads), single-node reliability is typically sufficient.

Persona 4: DBA Wanting an Audit Trail#

Context#

A DBA at a company with compliance requirements (financial services, healthcare, legal) needs to answer questions like:

  • “Who changed this record, and when?”
  • “What did the customer account table look like on December 31st?”
  • “Show me all changes to the pricing table in Q4.”
  • “Someone made a mistake — take the inventory table back to its state from yesterday morning.”

Traditional approaches include:

  • Trigger-based audit tables: A separate audit log table captures every change. Works, but requires schema design per-table, grows indefinitely, and makes point-in-time reconstruction expensive (must replay events).
  • Database temporal tables: Some databases (SQL Server, MariaDB) support system-time versioned tables. This works for point-in-time reads but does not provide branching or merging.
  • WAL-based CDC: Change Data Capture from the write-ahead log (Debezium, etc.) provides a stream of changes but requires downstream infrastructure to store and query the history.

What They Actually Need#

  1. Row-level change history: Every modification to a row captured without manual trigger setup.
  2. Point-in-time reads: Query the database as it existed at any past moment.
  3. Author attribution: Know which user or process made each change.
  4. Efficient rollback: Undo a set of changes cleanly, not by replaying inverse operations.
  5. Minimal application changes: The audit capability should not require restructuring queries.

Tool Fit#

Dolt provides all of these natively. Every INSERT, UPDATE, and DELETE on a committed transaction appears in the Dolt commit history. The dolt_diff_<tablename> system tables expose row-level changes between any two commits. Point-in-time reads are done with SELECT ... AS OF 'commit-hash-or-timestamp'. Rollback is CALL dolt_reset('--hard', 'commit-hash').

The DBA does not need to add trigger logic, design audit tables, or build change event pipelines. The version control is built into the storage layer and is always on.

Limitations for This Persona#

Dolt is not a direct MySQL replacement at compliance scale without careful evaluation:

  • Write performance: High-volume OLTP databases (millions of transactions per day) will be significantly impacted by Dolt’s overhead. This persona should benchmark carefully before migrating production OLTP workloads.
  • Storage growth: Every committed version of data is retained. Dolt supports garbage collection (dolt gc) but the semantics differ from conventional VACUUM/PURGE operations.
  • Stored procedure compatibility: Complex stored procedures may not be fully compatible.
  • Replication: Dolt does not support MySQL binlog replication for use with existing MySQL replicas.

For compliance use cases where the database is read-heavy or the write volume is moderate, Dolt is a compelling option. For high-throughput OLTP, the DBA should use trigger-based audit logging or temporal tables and not switch databases.

LakeFS and Nessie are not relevant to this persona (not SQL databases).


Synthesis: Which Persona Maps to Which Tool#

PersonaPrimary NeedBest ToolAlternative
Data EngineerReproducible pipelines, atomic writesLakeFS (if files), Nessie (if Iceberg)Dolt (if SQL output)
ML EngineerDataset versioning, diff, reproducibilityDolt (tabular data), LakeFS (files)DVC (file-based)
Infrastructure EngineerConfig/state store with branchingDoltNone (LakeFS/Nessie not applicable)
DBAAudit trail, point-in-time readsDoltDB temporal tables (limited)

The common thread for Dolt’s three winning personas is that the data is relational (SQL) and the access pattern is through SQL queries. LakeFS wins when the data is files and the workflow is object-storage native. Nessie wins when the team is Iceberg-native and wants catalog-level versioning.

The absence of a PostgreSQL-native option (Splitgraph’s stalled state) is a real gap. Teams that are deeply PostgreSQL-committed and cannot accept the MySQL wire protocol face a difficult choice between Dolt (accept MySQL compatibility layer) and Splitgraph (accept reduced maintenance quality).

S4: Strategic

S4: Strategic Discovery — Version-Controlled & Branchable Databases#

Topic: 1.188 — Version-Controlled & Branchable Databases Stage: S4 Strategic Discovery Date: 2026-03-04


Overview#

This stage assesses the long-term viability, ecosystem health, and strategic positioning of the version-controlled database tools. The goal is to understand which tools are safe to build on for a multi-year horizon and how the space is likely to evolve.


Ecosystem Health Assessment#

Dolt / DoltHub#

Funding & Organization: DoltHub Inc. is a venture-backed company (Series A closed 2021, total funding approximately $16M as of 2024 public disclosures). The company has not made further public funding announcements, suggesting either profitability at their scale or a focus on sustainable growth. The core team is small (10-20 people) but has maintained consistent development velocity.

Development Velocity: The Dolt GitHub repository has maintained a release cadence of approximately one release per 1-2 weeks for several years. The changelog is detailed and the team is responsive to issues. This is not a project that ships once a year.

Community Signals:

  • GitHub stars: ~18,000 (high for a database project)
  • Active Discord community with several hundred members
  • Technical blog (dolthub.com/blog) publishes regularly — benchmarks, engineering deep dives, case studies
  • DoltHub hosts hundreds of public datasets, indicating real adoption

Revenue Model: DoltHub Inc. generates revenue from DoltHub SaaS (hosted database service with paid tiers for private repos and collaboration) and from Doltgres (see below). This provides a sustainable funding path that does not depend on data volume or per-query pricing.

Doltgres: In 2023-2024, DoltHub began developing Doltgres — a PostgreSQL wire protocol-compatible version of Dolt. This directly addresses the most significant adoption barrier: teams committed to PostgreSQL can use Doltgres rather than accepting the MySQL compatibility layer. Doltgres is in active development but not yet production-ready as of early 2026. Its existence signals that DoltHub intends to capture the PostgreSQL market and is making a meaningful engineering investment to do so.

Risk Factors:

  • Small company: if DoltHub loses funding or pivots, Dolt is open-source (Apache 2.0) and could be maintained by the community, but at reduced velocity.
  • Performance gap: if MySQL or PostgreSQL add native branching (e.g., via extensions), Dolt’s primary value proposition is threatened. This seems unlikely in the near term — it would require fundamental storage engine changes — but is a non-zero long-term risk.
  • Prolly Tree as a point of concentration: the Prolly Tree storage format is Dolt’s core invention. If a fundamental limitation is discovered (correctness bug, performance cliff at scale), it would require significant re-engineering.

Strategic Verdict: Safe to build on. DoltHub has enough funding, community, and technical depth to be a reliable 3-5 year bet. The open-source license provides a safety net. The Doltgres investment shows strategic thinking beyond the MySQL niche.


LakeFS / Treeverse#

Funding & Organization: Treeverse Ltd. (the company behind LakeFS) raised a Series B of $26M in 2022, bringing total funding to approximately $40M. They are Israel-based and have grown the team significantly.

Development Velocity: LakeFS has active development. The repository shows consistent releases. The community Slack is active with thousands of members.

Community Signals:

  • GitHub stars: ~4,500 (solid for the data lake tooling space)
  • Integrations with major data platforms: Spark, Flink, dbt, Airflow, Presto, Trino
  • Listed as a recommended tool in several enterprise data architecture guides

Revenue Model: Treeverse offers a cloud-managed version (lakeFS Cloud) and an enterprise self-hosted tier. The open-source core is Apache 2.0, with enterprise features (SSO, RBAC, support) in the commercial tier. This follows the open-core model.

Open Core Concerns: The boundary between open-source and commercial features has shifted over LakeFS’s history. Some features that were initially open-source were moved to the enterprise tier. This is a common open-core evolution pattern, but it creates adoption risk: investing in LakeFS means accepting that some features you depend on may become commercial.

Competitive Landscape: LakeFS competes with:

  • Delta Sharing: Databricks-backed open protocol for data sharing. Less about versioning, more about access.
  • Iceberg REST Catalog + Nessie: For Iceberg users, this combination can replace LakeFS’s role.
  • DVC (Data Version Control): More developer-facing, file-level versioning. Different audience.

Strategic Verdict: Reasonably safe to build on for data lake use cases. The Series B funding provides runway. The primary risk is the open-core model’s feature boundary drift. Teams should evaluate whether their required features are in the open-source tier before committing.


Nessie / Dremio#

Funding & Organization: Project Nessie is Apache-licensed and primarily maintained by Dremio employees. Dremio is a well-funded data platform company (raised $160M total, includes participation from Andreessen Horowitz). Nessie is a strategic component of Dremio’s Arctic data catalog product.

Development Velocity: The Nessie repository is actively maintained, with releases tied to the Iceberg release cadence.

Community Signals:

  • GitHub stars: ~900 (lower, but expected for infrastructure tooling)
  • Deep integration in the Apache Iceberg ecosystem: Iceberg’s catalog interface explicitly supports Nessie
  • Increasingly mentioned in lakehouse architecture discussions

Apache Iceberg Momentum: The strategic context for Nessie is Apache Iceberg’s explosive growth. As Iceberg becomes the dominant open table format (competing with Delta Lake and Hudi), catalog versioning becomes a standard requirement. Nessie is well-positioned to become the default versioned catalog for Iceberg deployments.

Stewardship Risk: Nessie is primarily a Dremio project. If Dremio’s business pivots or it is acquired, Nessie’s development trajectory could change. The Apache license mitigates the worst case (project can be forked), but velocity would decrease. There has been community discussion about donating Nessie to the Apache Software Foundation, which would significantly reduce stewardship risk.

Strategic Verdict: Safe to build on within the Iceberg ecosystem. The risk is concentration in Dremio. For teams committed to Apache Iceberg, Nessie is likely to remain a viable and actively-maintained option for the foreseeable future.


Splitgraph#

Development Status: Development activity dropped sharply in 2022-2023. The core library (sgr) has open issues that remain unaddressed. The team appears to have focused on the data catalog/sharing product rather than the versioning engine.

Strategic Verdict: Not recommended for new projects. Splitgraph is in maintenance mode. Teams building on it accept the risk of maintaining the tooling themselves if they encounter issues.


Decision Matrix#

Primary Selection Criteria#

CriterionDoltLakeFSNessieSplitgraph
SQL / relational dataExcellentNot applicableNot applicableGood
File / object dataNot applicableExcellentNot applicableNot applicable
Iceberg tablesNot applicableGoodExcellentNot applicable
Branch/merge semanticsFullFullFullPartial
Production readinessHighHighMedium-HighLow
Performance overheadMedium (2x MySQL)LowVery LowMedium
Community healthStrongStrongModerateWeak
Long-term safetyHighMedium-HighMedium-HighLow
PostgreSQL compatibilityPartial (Doltgres in dev)N/AN/AYes (but stalled)
MySQL compatibilityFullN/AN/AN/A
Self-hostableYesYesYesYes
SaaS optionDoltHublakeFS CloudDremio Arcticdata.splitgraph.com

Secondary Selection Criteria#

Team SQL familiarity: If your team thinks in SQL and operates a SQL database, Dolt is the lowest adoption friction. Git-for-files approaches (LakeFS) require thinking about your data as a file namespace, which may be unfamiliar.

Existing infrastructure: If you already operate Spark on S3, LakeFS slots in with minimal disruption. If you already run Iceberg with Hive Metastore, replacing the metastore with Nessie is bounded. If you run MySQL, Dolt is a potential drop-in.

Compliance requirements: Dolt provides the strongest story for SQL-level audit requirements. Financial or healthcare data stored in SQL tables can use Dolt’s built-in history as the audit log, potentially replacing bespoke audit trigger infrastructure.

Collaboration workflows: DoltHub’s pull request model for data is the most mature collaborative data workflow in this space. Teams that want human review of data changes before merge will find DoltHub compelling.


The Lakehouse Convergence#

The data industry is converging on the lakehouse architecture: open table formats (Iceberg, Delta Lake) stored in object storage, queried by compute engines (Spark, Trino, Dremio). This architecture puts Nessie in an increasingly strategic position, as catalog versioning becomes a standard requirement for production lakehouses.

Prediction: By 2027, a versioned Iceberg catalog (Nessie or equivalent) will be considered a standard component of enterprise data infrastructure, in the same way that git is a standard component of software development infrastructure.

Dolt’s Expanding Market#

Dolt’s MySQL compatibility has been both its strength (easy adoption) and its limitation (misses the PostgreSQL market). Doltgres is the company’s bet on expanding beyond MySQL. If Doltgres achieves production-readiness, Dolt’s addressable market roughly doubles.

Additionally, as LLM-based applications generate and manipulate structured data at scale, the need for version-controlled databases could grow significantly. An LLM agent that modifies a database table needs the same isolation/merge semantics that a human data engineer does — arguably more so, given the velocity of LLM-driven modifications.

Git-for-Data as a Standard Pattern#

The conceptual pattern — apply git semantics to data — is becoming increasingly well-understood. As more teams adopt this pattern, tooling will mature and interoperability will improve. The current fragmentation (different tools for different data formats) may consolidate around a smaller set of standards.

The Apache Iceberg ecosystem’s growth may produce a gravity well that pulls LakeFS and Nessie into tighter integration, or into competition. Either outcome accelerates the maturation of the space.


Recommendation Summary#

For new projects in 2026:

  1. SQL/relational data with branching needs → Dolt. No serious alternatives at this tier.
  2. Data lake pipelines on object storage → LakeFS. Evaluate enterprise needs against open-core risk.
  3. Apache Iceberg deployments → Nessie. Consider pairing with LakeFS for file-level versioning.
  4. PostgreSQL-native requirements → Wait for Doltgres, or evaluate Splitgraph with the understanding that you may need to maintain it.
  5. Config/state stores with audit requirements → Dolt. Especially compelling for internal tooling at moderate write volumes.

For Gas Town infrastructure specifically: Dolt (via Beads) is the correct choice and is well-validated. The tool is appropriately matched to the use case: low-volume, SQL-queryable, multi-branch agent coordination. No migration is warranted. Monitor Doltgres development if the infrastructure ever needs PostgreSQL compatibility.

Published: 2026-03-04 Updated: 2026-03-04