engineering

Apr 16, 2025

Written by Ciro Greco

Hello Bauplan

Bauplan is a serverless data platform that treats pipelines, models, and tables like software — versioned, testable, and ready for agents.

TL;DR

AI has changed what it means to work with data. Developers aren’t just analyzing it, they’re building with it: copilots, agents, pipelines, production ML.

But the tooling is from a previous era. Today, shipping data pipelines means stitching together notebooks, orchestrators, runtimes, Spark, SQL engines — each with its own interface, runtime, and abstractions.

That model worked when the goal was to analyze. But now we need to automate — continuously, reliably, and at scale. As data pipelines start to behave like software, they need to be built and operated like software too.

We built Bauplan to bring software engineering principles to data. Bauplan treats data pipelines, tables, and models like software: versioned, composable, and fully scriptable. 

We aim to unify data exploration, development, and production into a single code-based platform. We want to help organizations to get rid of infrastructure and move faster while improving robustness. We’re building for a world where data is versioned, testable, and deployable – just like code. Designed for developers, ready for AI agents.

Dawn of the new everything

We spent the last 10 years building data or AI systems at scale. When we started, data was fairly exoteric. Today, many among the most business critical applications out there are data applications, like search and recommendation systems. Tomorrow, the majority of software developers will deal with data pipelines and AI models in every application they build.  

Two big things happened:

  • First, AI. Developers are no longer just building dashboards, they’re building copilots, RAG pipelines, and autonomous agents. These are end-to-end software applications that need fresh data, scalable storage, fine-tuning, observability, orchestration and CI/CD.

  • Second, object storage. Open formats like Apache Iceberg are becoming the standard for working with large-scale data on data lakes, bringing ACID transactions, schema evolution, and time travel. 

Data tooling still does not reflect that. On the one hand, data platforms are built on monolithic warehouses, Spark clusters, Notebooks and GUI-based user experiences.  On the other hand, developer tools don’t make for good data platforms: shipping simple pipelines takes too many tools, feedback loops in the cloud are very slow,  and too much infrastructure makes dev and prod too different from one another.

Functions and object storage are all you need

Bauplan is the simplest data platform we could think of. We adopted a pretty simple design principle: the likelihood of adopting an abstraction is inversely proportional to the time it takes a CS grad to grasp it.

It allows you to build your data platform on object storage and gives you a set of primitives in pure Python to build versioned, production-grade pipelines with no infrastructure overhead.

  • You write pipelines as serverless functions. They run in a fully managed runtime, like Lambda, but optimized for large-scale data processing and with native abstractions for tabular formats (I/O, caching, projections, filter pushdowns).

  • You build on object storage. Your parquet and csv files are stored in S3 as Iceberg tables, enabling schema evolution, partitioning, and time travel.

  • Every run, every artifact, and every model is tracked through a Git-style commit log — so you get reproducibility, auditability, and CI/CD for data, out of the box.

  • Everything is code. The platform offers a consistent developer interface through a CLI and a Python SDK, allowing for easy scripting and automation​.

Everything is code

Bauplan is a data platform built like software. You define pipelines, tables, metadata, and environments the same way you define logic in an application — with functions, packages, commits, and branches.

This unlocks a several things that become essential as systems grow:

  • Reproducibility: Every output is tied to a commit. You can trace, rerun, and revert everything.

  • Branching: You can isolate changes to a dataset or a pipeline, then merge them safely — just like Git.

  • Composition: Pipelines are functions. They can call each other, be packaged, tested, templated.

  • Automation: Everything is scriptable. Nothing depends on a GUI or a YAML spec locked to a single runtime.

When velocity increases, code always wins. We saw it in DevOps: bash commands became Terraform, SFTP became CI/CD, monitoring dashboards became config-defined alerts. We will see it again in data and AI. Especially now,  because code doesn’t just work for humans. It works for machines.

As agents and LLMs move from novelty to production tooling, they’ll need to build and operate data systems themselves — generating queries, assembling pipelines, refining models. That only works if the platform is fully code-native.

One of our users recently built a threat detection system where a generative agent calls Bauplan’s APIs to generate, validate, and execute analytical queries. No humans in the loop. No GUI required. Just code.

In a world where machines start acting like developers, code-first isn’t a nice-to-have. It’s the interface that makes it possible.

You Can’t Scale Complexity

Working with data boils down to three scenarios: interactive explorations, pipeline & model development, orchestration and optimization at scale. Today’s data platform split these three scenarios into different runtimes, developer interfaces and abstractions. In large companies, these scenarios are likely to even belong to three different teams.


Interactive Explorations

Pipeline & Model Development

Orchestration & Optimization

What You Do

Explore data, build dashboards

Build data pipelines, train/test models

Run pipelines and models reliably, at scale, on a schedule

Infrastructure

Data Warehouse query engines

Python, Ray, single node Spark 

Kubernetes, Spark clusters, Ray clusters

Developer Interfaces

SQL editors, JDBC drivers

Notebook, interactive Spark session

VS Code,
Spark Submit API

Abstractions

Tables and  views

Dataframes, DAGs, models

Orchestration tasks, schedules, triggers, retries, resource configs

Complexity is very expensive. It slows down development, it creates friction between teams, it leads to fragile pipelines, redundant tooling, and environments that drift out of sync. And it makes it harder to apply even basic software engineering principles — testing, versioning, rollback — to data work.

In a world where data is part of everyday software development, we need to be simpler. Ideally, we want one simple  workflow with the same developer interface - at least for most use cases - and to use abstractions that all developers already have in common like functions, packages, tables, commits and branches.


Interactive Explorations

Pipeline & Model Development

Orchestration & Optimization

What You Do

Explore data, build dashboards

Build data pipelines, train/test models

Run pipelines and models reliably, at scale, on a schedule

Infrastructure

Serverless Functions 

Developer Interface

SQL Editors & IDE

Abstractions

Functions, Tables, Git 

Where we're going

Bauplan is used by users to power many kinds of data workflows— data products, write-audit-publish (WAP), ML models for forecasting, data apps, recommender systems, near real-time analytics, RAG pipelines, data enrichment, agentic applications and more. 

As MCP servers for lakehouses start emerging in our community, we have a glimpse of the future: data platforms are not just for humans, they are a playground for agents to investigate on our behalf.

Bauplan is built for both: humans and machines. Typed APIs, branching, composability were never thought to be as important for data as they are for software, in part because those abstractions were thought hard to master for beginners. 

AI agents work just the opposite. They are great with code, types and functions, not with UIs and Notebooks.

We want to give developers a way to build data and AI systems with the same ergonomics and control they expect from software engineering. That means simpler primitives, tighter feedback loops, and a platform that works with the grain of modern development — not against it.
Today, even in private beta, Bauplan runs over 40,000 jobs per week for early clients across AI, media, and B2B SaaS. And we’re just getting started. Developers shouldn’t need five tools to move one dataset — or rely on notebooks and DAGs to build production systems.

Spark developers spend most of their time debugging, not scaling. Airflow users struggle to define and operate DAGs across fragmented environments. The complexity tax is real — and unsustainable.

That’s why we’re excited to share that Bauplan has raised a $7.5M seed round, led by Innovation Endeavors, with participation from some of the most respected minds in AI and infrastructure: Wes McKinney (creator of Pandas and Apache Arrow), Aditya Agarwal (ex-Dropbox CTO), Chris Ré (Stanford, TogetherAI), Ihab Ilyas (Tamr, University of Waterloo), Jeffrey J. Rothschild, and Spencer Kimball (CockroachDB).

Today, even in private beta, Bauplan runs over 40,000 jobs a week for early clients across AI, media, and B2B SaaS. There’s a lot more to come, from deeper LLM support, to vector-first compute, to collaborative tooling around versioning and branching. But the core idea stays simple: make data programmable like software. If that resonates with the systems you're building, we'd love to hear from you.

Love Python and Go development, serverless runtimes, data lakes and Apache Iceberg, and superb DevEx? We do too! Subscribe to our newsletter.

Try bauplan

Try bauplan