Turn weeks of data infrastructure into a few lines of Python
Build robust AI and data applications over S3 with Serverless functions and Iceberg tables. No Kubernetes, no Spark, no infrastructure to manage.
Turn weeks of data infrastructure into a few lines of Python
Build robust AI and data applications over S3 with Serverless functions and Iceberg tables. No Kubernetes, no Spark, no infrastructure to manage.
Branch
Import
Run
Merge
import bauplan client = bauplan.Client() # Create a new branch branch = client.create_branch( branch="dev_branch", from_ref="main" ) print(f'Created branch "{branch.name}"') # List tables in the branch for table in client.get_tables(ref=branch): print(f"{table.namespace:<12} {table.name:<12} {table.kind}")
Create sandboxes instantly without data duplication
Branch
Import
Run
Merge
import bauplan client = bauplan.Client() # Create a new branch branch = client.create_branch( branch="dev_branch", from_ref="main" ) print(f'Created branch "{branch.name}"') # List tables in the branch for table in client.get_tables(ref=branch): print(f"{table.name:<12} {table.kind}")
Zero copy data lake branches
Branch
Import
Run
Merge
import bauplan client = bauplan.Client() # Create a new branch branch = client.create_branch( branch="dev_branch", from_ref="main" ) print(f'Created branch "{branch.name}"') # List tables in the branch for table in client.get_tables(ref=branch): print(f"{table.name:<12} {table.kind}")
Zero copy data lake branches
Build an Iceberg Lakehouse in pure Python
Build an Iceberg Lakehouse in pure Python
Implement a Data Lakehouse architecture over object storage with a Write-Audit-Publish pattern for safe data ingestion.
Implement a Data Lakehouse architecture over object storage with a Write-Audit-Publish pattern for safe data ingestion.
Implement a Data Lakehouse architecture over object storage with a Write-Audit-Publish pattern for safe data ingestion.
End-to-end Machine Learning Pipeline
End-to-end Machine Learning Pipeline
A streamlined pipeline transforming raw data into a predictions table: prepare a training set and train a Linear Regression model.
A streamlined pipeline transforming raw data into a predictions table: prepare a training set and train a Linear Regression model.
A streamlined pipeline transforming raw data into a predictions table: prepare a training set and train a Linear Regression model.
Data Augmentation with LLMs
Data Augmentation with LLMs
Entity matching in across different datasets, using OpenAI APIs. This example is on e-commerce product catalogs.
Entity matching in across different datasets, using OpenAI APIs. This example is on e-commerce product catalogs.
Entity matching in across different datasets, using OpenAI APIs. This example is on e-commerce product catalogs.
Read more
Built with Bauplan
Forecasting ML Pipeline
A streamlined pipeline transforming raw data into a predictions table: prepare a training set and train a Linear Regression model.
Data Augmentation with OpenAI
Entity matching across different e-commerce product catalogs, leveraging off-the-shelf LLM API from OpenAI. The entire project runs on object storage (S3) in open formats (Iceberg), relying solely on vanilla Python to orchestrate the DAG and integrate AI services.
Build a Lakehouse with Apache Iceberg in pure Python
Data Lakehouse architecture over object storage with Iceberg tables and robust Write-Audit-Publish workflows for safe data ingestion. Build a Lakehouse in ~150 lines of Python without needing a Data Warehouse, JVM, or Iceberg expertise.
Data quality and expectations
Define data quality constraints with expectations to enforce standards and monitor pipeline updates using blazing fast vectorized tests.
Interactive visualization with Streamlit
Build a data transformation pipeline and visualize results with Streamlit using SQL querying and branching.
Near Real-Time Analytics
Full stack real-time data pipeline for e-commerce analytics. This project features data ingestion, transformations, and live dashboards for key metrics like revenue and engagement, all managed with branch-based workflows and minimal setup.
The optimal workflow for data teams
Branch
Instant Zero-Copy
Quickly spin up development and testing environments without duplicating data.
Version Control for Data
Work with data the way you work with code. Use familiar operations like branching, checkout, and merging.
Safe and Sandboxed Experiments
Keep your production environment safe. Collaborate in fully isolated, sandboxed environments.
Instant Zero-Copy Environments
Quickly spin up development and testing environments without duplicating data.
Version Control for Data
Work with data the way you work with code. Use familiar operations like branching, checkout, and merging.
Safe and Sandboxed Experiments
Keep your production environment safe. Collaborate in fully isolated, sandboxed environments.
import bauplan client = bauplan.Client() # Create a new branch branch = client.create_branch( branch="dev_branch", from_ref="main" ) print(f'Created branch "{branch.name}"') # List tables in the branch for table in client.get_tables(ref=branch): print(f"{table.namespace:<12} {table.name:<30} {table.kind}")
import bauplan client = bauplan.Client() # Create a new branch branch = client.create_branch( branch="dev_branch", from_ref="main" ) print(f'Created branch "{branch.name}"') # List tables in the branch for table in client.get_tables(ref=branch): print(f"{table.name:<30} {table.kind}")
Develop
No Infrastructure to Manage
Define environments entirely in code — never worry about containers and environment management.
Pure Python
Build and test data applications directly in your IDE — no need to learn new frameworks, just code as you normally would.
Serverless Functions
Execute your workloads seamlessly in the cloud, combining serverless functions into pipelines.
import bauplan @bauplan.model() # Specify Python environment with exact package versions @bauplan.python(pip={'pandas': '2.2.0'}) def clean_data( # Input model reference - points to existing table data=bauplan.Model('my_data') ): import pandas as pd # Your data transformation logic here ... return clean_data
Automate
Merge your changes
Deploy by merging new tables into your main data lake. Use our Python SDK, automate your CI/CD pipelines and deployment.
Built-In testing
Incorporate unit tests and expectations directly into your workflows, ensuring your data is always reliable and consistent.
Effortless Integration
Connect to visualization tools and orchestrators with just one line of code.
import bauplan client = bauplan.Client() # create a zero-copy branch of your data lake client.create_branch(dev_branch, from_ref='main') # create an Iceberg table and import data in it client.create_table(table_name, data_source, dev_branch) client.import_data(table_name, data_source, dev_branch) # run a pipeline end-to-end in a branch client.run('./my_project_dir', dev_branch) # merge the new tables into the main data lake client.merge_branch(dev_branch, into_branch='main') print('So Long, and Thanks for All the Fish')
import bauplan @bauplan.model() # Specify Python environment with exact package versions @bauplan.python(pip={'pandas': '2.2.0'}) def clean_data( # Input model reference - points to existing table/model data=bauplan.Model('my_data') ): import pandas as pd # Your data transformation logic here ... return clean_data
Develop
Serverless Functions
Execute your workloads seamlessly in the cloud, combining serverless functions into pipelines.
Pure Python
Build and test data applications directly in your IDE — no need to learn new frameworks, just code as you normally would.
No Infrastructure
Define environments entirely in code — never worry about containers and environment management.
Automate
Merge your changes
Deploy by merging new tables into your main data lake. Use our Python SDK, automate your CI/CD pipelines and deployment.
Built-In testing
Incorporate unit tests and expectations directly into your workflows, ensuring your data is always reliable and consistent.
Effortless Integration
Connect to visualization tools and orchestrators with just one line of code.
import bauplan client = bauplan.Client() # create a zero-copy branch of your data lake client.create_branch(dev_branch, from_ref='main') # create an Iceberg table and import data in it client.create_table(table_name, data_source, dev_branch) client.import_data(table_name, data_source, dev_branch) # run a pipeline end-to-end in a branch client.run('./my_project_dir', dev_branch) # merge the new tables into the main data lake client.merge_branch(dev_branch, into_branch='main') print('So Long, and Thanks for All the Fish')
Branch
Instant Zero-Copy Environments
Quickly spin up development and testing environments without duplicating data.
Version Control for Data
Work with data the way you work with code. Use familiar operations like branching, checkout, and merging.
Safe and Sandboxed Experiments
Keep your production environment safe. Collaborate in fully isolated, sandboxed environments.
import bauplan client = bauplan.Client() # Create a new branch branch = client.create_branch( branch="dev_branch", from_ref="main" ) print(f'Created branch "{branch.name}"') # List tables in the branch for table in client.get_tables(ref=branch): print(f"{table.namespace:<12} {table.name:<30} {table.kind}")
import bauplan @bauplan.model() # Specify Python environment with exact package versions @bauplan.python(pip={'pandas': '2.2.0'}) def clean_data( # Input model reference - points to existing table/model data=bauplan.Model('my_data') ): import pandas as pd # Your data transformation logic here ... return clean_data
Develop
Serverless Functions
Execute your workloads seamlessly in the cloud, combining serverless functions into pipelines.
Pure Python
Build and test data applications directly in your IDE — no need to learn new frameworks, just code as you normally would.
No Infrastructure to Manage
Define environments entirely in code — never worry about containers and environment management.
Develop
No Infrastructure to Manage
Define environments entirely in code — never worry about containers and environment management.
Pure Python
Build and test data applications directly in your IDE — no need to learn new frameworks, just code as you normally would.
Serverless Functions
Execute your workloads seamlessly in the cloud, combining serverless functions into pipelines.
import bauplan @bauplan.model() # Specify Python environment with exact package versions @bauplan.python(pip={'pandas': '2.2.0'}) def clean_data( # Input model reference - points to existing table/model data=bauplan.Model('my_data') ): import pandas as pd # Your data transformation logic here ... return clean_data
Automate
Merge your changes
Deploy by merging new tables into your main data lake. Use our Python SDK, automate your CI/CD pipelines and deployment.
Built-In testing
Incorporate unit tests and expectations directly into your workflows, ensuring your data is always reliable and consistent.
Effortless Integration
Connect to visualization tools and orchestrators with just one line of code.
import bauplan client = bauplan.Client() # create a zero-copy branch of your data lake client.create_branch(dev_branch, from_ref='main') # create an Iceberg table and import data in it client.create_table(table_name, data_source, dev_branch) client.import_data(table_name, data_source, dev_branch) # run a pipeline end-to-end in a branch client.run('./my_project_dir', dev_branch) # merge the new tables into the main data lake client.merge_branch(dev_branch, into_branch='main') print('So Long, and Thanks for All the Fish')
Bauplan: Zero-copy, Scale-up FaaS for Data Pipelines
Paper presented at WoSC10 2024. In collaboration with The University of Wisconsin.
by J. Tagliabue, T. Caraza-Harter and C. Greco
Build, ship and run containers is too slow for Python.
Making the experience of running data workflow in the cloud indistinguishable than doing it locally.
by N. LeClaire and C. Greco
To serverless or not to serverless
Find the right balance between cost control and fast startup time for your Spark clusters.
by C. Greco