At a data meetup I attended recently, I overheard two people having a more-animated-than-expected conversation about data version control. Each had strong feelings about how best to track and manage changes to data files over time.

One point they both agreed upon is that Git is a very useful tool, but maybe not optimal for data version control. If you’re not familiar, Git is a free and open-source distributed version control system (DVCS) designed to manage changes to files and projects over time.

The Shortcomings of Git for Data Version Control.

If you’ve ever tried using Git to manage data sets, you’ve probably hit a wall fast. Git was built for code — small, text-based files that change in predictable ways. Data is a different beast. It is large, binary, and constantly evolving. When you try to push a few gigabytes of CSVs or Parquet files, Git grinds to a halt. Storage balloons, merges become painful, and collaboration turns into chaos.

The issue is simple: Git tracks file changes line by line. That works beautifully for scripts or configs, but it’s not efficient for structured or binary data. Each commit of a dataset is effectively a full copy, not a diff. Multiply that across teams and experiments, and your repo quickly becomes unusable.

Data Version Control Tools to Consider.

That’s why data versioning tools were built — to extend Git’s philosophy to data-scale problems. Let’s look at three standout options…

  • DVC (Data Version Control) integrates directly with Git, storing data in remote backends like S3 or GCS while keeping lightweight references in your repo. It’s ideal for ML pipelines where data sets and models evolve alongside code.
  • LakeFS takes a different approach, bringing Git-like branching and commits to object storage. You can create isolated “branches” of your data lake to test changes safely, then merge them back when validated.
  • Dolt treats data as tables in a version-controlled database. Every change is tracked, diffed, and queryable with SQL — a game changer for teams collaborating on structured data.

These tools don’t replace Git; they complement it. Git remains the control center for your logic and configurations, while specialized tools handle the heavy lifting of dataset storage, lineage, and reproducibility.

If your organization is scaling data operations, treating data like code isn’t just a slogan — it’s a necessity. But Git alone wasn’t designed for terabytes of data or the workflows that come with it. Pairing it with the right data versioning system gives you the speed, traceability, and collaboration you actually need.

So keep Git where it shines, and let purpose-built tools manage your data. Your future self, and your storage bill, will thank you.

The Takeaway.

Git is a very useful tool, but perhaps not the best for tracking and managing changes to large data sets.

What about you? How do you track and manage changes in your data? Are there any tools or tips you’d recommend? Please comment – I’d love to hear your thoughts.

Also, please connect with DIH on LinkedIn.

Thanks,
Tom Myers