Turn weeks of data infrastructure into
a few lines of Python
Build robust AI and data applications over S3 with Serverless functions and Iceberg tables. No Kubernetes, no Spark, no infrastructure to manage.
Turn weeks of data infrastructure into
a few lines of Python
Build robust AI and data applications over S3 with Serverless functions and Iceberg tables. No Kubernetes, no Spark, no infrastructure to manage.
Branch
Import
Run
Merge
import bauplan client = bauplan.Client() # Create a new branch branch = client.create_branch( branch="dev_branch", from_ref="main" ) print(f'Created branch "{branch.name}"') # List tables in the branch for table in client.get_tables(ref=branch): print(f"{table.namespace:<12} {table.name:<12} {table.kind}")
Create sandboxes instantly without data duplication
Branch
Import
Run
Merge

import bauplan client = bauplan.Client() # Create a new branch branch = client.create_branch( branch="dev_branch", from_ref="main" ) print(f'Created branch "{branch.name}"') # List tables in the branch for table in client.get_tables(ref=branch): print(f"{table.name:<12} {table.kind}")
Zero copy data lake branches
Branch
Import
Run
Merge

import bauplan client = bauplan.Client() # Create a new branch branch = client.create_branch( branch="dev_branch", from_ref="main" ) print(f'Created branch "{branch.name}"') # List tables in the branch for table in client.get_tables(ref=branch): print(f"{table.name:<12} {table.kind}")
Zero copy data lake branches


Build an Iceberg Lakehouse in pure Python
Build an Iceberg Lakehouse in pure Python
Implement a Data Lakehouse architecture over object storage with a Write-Audit-Publish pattern for safe data ingestion.
Implement a Data Lakehouse architecture over object storage with a Write-Audit-Publish pattern for safe data ingestion.
Implement a Data Lakehouse architecture over object storage with a Write-Audit-Publish pattern for safe data ingestion.
From unstructured to structured data with LLMs
From unstructured to structured data with LLMs
Transform PDFs into structured tables using a Bauplan LLM pipeline, ensuring versioning, safe experimentation, and seamless analysis.
End-to-End RecSys with MongoDB
End-to-End RecSys with MongoDB
Build a playlist recommender with Bauplan and MongoDB, training embeddings on Spotify data and visualizing results with Streamlit.
End-to-end Machine Learning Pipeline
A streamlined pipeline transforming raw data into a predictions table: prepare a training set and train a Linear Regression model.
Data Augmentation with LLMs
Entity matching in across different datasets, using OpenAI APIs. This example is on e-commerce product catalogs.
Read more
Built with Bauplan
From unstructured to structured data with LLMs
Transform PDFs into structured tables using a Bauplan LLM pipeline, ensuring versioning, safe experimentation, and seamless analysis.
End-to-End RecSys with MongoDB
Build a full-stack music recommender system using Bauplan for data processing and MongoDB for serving. Train embeddings on Spotify playlists, store them in Iceberg and MongoDB, and explore recommendations via a Streamlit app.
Build a Lakehouse with Apache Iceberg in pure Python
Data Lakehouse architecture over object storage with Iceberg tables and robust Write-Audit-Publish workflows for safe data ingestion. Build a Lakehouse in ~150 lines of Python without needing a Data Warehouse, JVM, or Iceberg expertise.
Data quality and expectations
Define data quality constraints with expectations to enforce standards and monitor pipeline updates using blazing fast vectorized tests.
Forecasting ML Pipeline
A streamlined pipeline transforming raw data into a predictions table: prepare a training set and train a Linear Regression model.
Data Augmentation with OpenAI
Entity matching across different e-commerce product catalogs, leveraging off-the-shelf LLM API from OpenAI. The entire project runs on object storage (S3) in open formats (Iceberg), relying solely on vanilla Python to orchestrate the DAG and integrate AI services.
The optimal workflow for data teams
Branch
Instant Zero-Copy
Quickly spin up development and testing environments without duplicating data.
Version Control for Data
Work with data the way you work with code. Use familiar operations like branching, checkout, and merging.
Safe and Sandboxed Experiments
Keep your production environment safe. Collaborate in fully isolated, sandboxed environments.
Instant Zero-Copy Environments
Quickly spin up development and testing environments without duplicating data.
Version Control for Data
Work with data the way you work with code. Use familiar operations like branching, checkout, and merging.
Safe and Sandboxed Experiments
Keep your production environment safe. Collaborate in fully isolated, sandboxed environments.
import bauplan client = bauplan.Client() # Create a new branch branch = client.create_branch( branch="dev_branch", from_ref="main" ) print(f'Created branch "{branch.name}"') # List tables in the branch for table in client.get_tables(ref=branch): print(f"{table.namespace:<12} {table.name:<30} {table.kind}")
import bauplan client = bauplan.Client() # Create a new branch branch = client.create_branch( branch="dev_branch", from_ref="main" ) print(f'Created branch "{branch.name}"') # List tables in the branch for table in client.get_tables(ref=branch): print(f"{table.name:<30} {table.kind}")
Develop
No Infrastructure to Manage
Define environments entirely in code — never worry about containers and environment management.
Pure Python
Build and test data applications directly in your IDE — no need to learn new frameworks, just code as you normally would.
Serverless Functions
Execute your workloads seamlessly in the cloud, combining serverless functions into pipelines.
import bauplan @bauplan.model() # Specify Python environment with exact package versions @bauplan.python(pip={'pandas': '2.2.0'}) def clean_data( # Input model reference - points to existing table data=bauplan.Model('my_data') ): import pandas as pd # Your data transformation logic here ... return clean_data
Automate
Merge your changes
Deploy by merging new tables into your main data lake. Use our Python SDK, automate your CI/CD pipelines and deployment.
Built-In testing
Incorporate unit tests and expectations directly into your workflows, ensuring your data is always reliable and consistent.
Effortless Integration
Connect to visualization tools and orchestrators with just one line of code.
import bauplan client = bauplan.Client() # create a zero-copy branch of your data lake client.create_branch(dev_branch, from_ref='main') # create an Iceberg table and import data in it client.create_table(table_name, data_source, dev_branch) client.import_data(table_name, data_source, dev_branch) # run a pipeline end-to-end in a branch client.run('./my_project_dir', dev_branch) # merge the new tables into the main data lake client.merge_branch(dev_branch, into_branch='main') print('So Long, and Thanks for All the Fish')
import bauplan @bauplan.model() # Specify Python environment with exact package versions @bauplan.python(pip={'pandas': '2.2.0'}) def clean_data( # Input model reference - points to existing table/model data=bauplan.Model('my_data') ): import pandas as pd # Your data transformation logic here ... return clean_data
Develop
Serverless Functions
Execute your workloads seamlessly in the cloud, combining serverless functions into pipelines.
Pure Python
Build and test data applications directly in your IDE — no need to learn new frameworks, just code as you normally would.
No Infrastructure
Define environments entirely in code — never worry about containers and environment management.
Automate
Merge your changes
Deploy by merging new tables into your main data lake. Use our Python SDK, automate your CI/CD pipelines and deployment.
Built-In testing
Incorporate unit tests and expectations directly into your workflows, ensuring your data is always reliable and consistent.
Effortless Integration
Connect to visualization tools and orchestrators with just one line of code.
import bauplan client = bauplan.Client() # create a zero-copy branch of your data lake client.create_branch(dev_branch, from_ref='main') # create an Iceberg table and import data in it client.create_table(table_name, data_source, dev_branch) client.import_data(table_name, data_source, dev_branch) # run a pipeline end-to-end in a branch client.run('./my_project_dir', dev_branch) # merge the new tables into the main data lake client.merge_branch(dev_branch, into_branch='main') print('So Long, and Thanks for All the Fish')
Branch
Instant Zero-Copy Environments
Quickly spin up development and testing environments without duplicating data.
Version Control for Data
Work with data the way you work with code. Use familiar operations like branching, checkout, and merging.
Safe and Sandboxed Experiments
Keep your production environment safe. Collaborate in fully isolated, sandboxed environments.
import bauplan client = bauplan.Client() # Create a new branch branch = client.create_branch( branch="dev_branch", from_ref="main" ) print(f'Created branch "{branch.name}"') # List tables in the branch for table in client.get_tables(ref=branch): print(f"{table.namespace:<12} {table.name:<30} {table.kind}")
import bauplan @bauplan.model() # Specify Python environment with exact package versions @bauplan.python(pip={'pandas': '2.2.0'}) def clean_data( # Input model reference - points to existing table/model data=bauplan.Model('my_data') ): import pandas as pd # Your data transformation logic here ... return clean_data
Develop
Serverless Functions
Execute your workloads seamlessly in the cloud, combining serverless functions into pipelines.
Pure Python
Build and test data applications directly in your IDE — no need to learn new frameworks, just code as you normally would.
No Infrastructure to Manage
Define environments entirely in code — never worry about containers and environment management.
Develop
No Infrastructure to Manage
Define environments entirely in code — never worry about containers and environment management.
Pure Python
Build and test data applications directly in your IDE — no need to learn new frameworks, just code as you normally would.
Serverless Functions
Execute your workloads seamlessly in the cloud, combining serverless functions into pipelines.
import bauplan @bauplan.model() # Specify Python environment with exact package versions @bauplan.python(pip={'pandas': '2.2.0'}) def clean_data( # Input model reference - points to existing table/model data=bauplan.Model('my_data') ): import pandas as pd # Your data transformation logic here ... return clean_data
Automate
Merge your changes
Deploy by merging new tables into your main data lake. Use our Python SDK, automate your CI/CD pipelines and deployment.
Built-In testing
Incorporate unit tests and expectations directly into your workflows, ensuring your data is always reliable and consistent.
Effortless Integration
Connect to visualization tools and orchestrators with just one line of code.
import bauplan client = bauplan.Client() # create a zero-copy branch of your data lake client.create_branch(dev_branch, from_ref='main') # create an Iceberg table and import data in it client.create_table(table_name, data_source, dev_branch) client.import_data(table_name, data_source, dev_branch) # run a pipeline end-to-end in a branch client.run('./my_project_dir', dev_branch) # merge the new tables into the main data lake client.merge_branch(dev_branch, into_branch='main') print('So Long, and Thanks for All the Fish')



Bauplan: Zero-copy, Scale-up FaaS for Data Pipelines
Paper presented at WoSC10 2024. In collaboration with The University of Wisconsin.
by J. Tagliabue, T. Caraza-Harter and C. Greco



Build, ship and run containers is too slow for Python.
Making the experience of running data workflow in the cloud indistinguishable than doing it locally.
by N. LeClaire and C. Greco



Blending DuckDB and Iceberg for Optimal Cloud OLAP
Lessons learned crafting a Serverless Lakehouse from spare parts
by N. LeClaire

Try bauplan for free
Create your sandbox and start building.

Try bauplan for free
Create your sandbox and start building.

Try bauplan for free
Create your sandbox and start building.