Turn weeks of data infrastructure into
a few lines of Python

Build robust AI and data applications over S3 with Serverless functions and Iceberg tables. No Kubernetes, no Spark, no infrastructure to manage.

Turn weeks of data infrastructure into
a few lines of Python

Build robust AI and data applications over S3 with Serverless functions and Iceberg tables. No Kubernetes, no Spark, no infrastructure to manage.

Branch

Import

Run

Merge

import bauplan

client = bauplan.Client()

# Create a new branch
branch = client.create_branch(
   branch="dev_branch",
   from_ref="main"
)
print(f'Created branch "{branch.name}"')
# List tables in the branch
for table in client.get_tables(ref=branch):
   print(f"{table.namespace:<12} {table.name:<12} {table.kind}")

Create sandboxes instantly without data duplication

Branch

Import

Run

Merge

import bauplan

client = bauplan.Client()

# Create a new branch
branch = client.create_branch(
   branch="dev_branch",
   from_ref="main"
)
print(f'Created branch "{branch.name}"')
# List tables in the branch
for table in client.get_tables(ref=branch):
   print(f"{table.name:<12} {table.kind}")

Zero copy data lake branches

Branch

Import

Run

Merge

import bauplan

client = bauplan.Client()

# Create a new branch
branch = client.create_branch(
   branch="dev_branch",
   from_ref="main"
)
print(f'Created branch "{branch.name}"')
# List tables in the branch
for table in client.get_tables(ref=branch):
   print(f"{table.name:<12} {table.kind}")

Zero copy data lake branches

Build an Iceberg Lakehouse in pure Python

Build an Iceberg Lakehouse in pure Python

Implement a Data Lakehouse architecture over object storage with a Write-Audit-Publish pattern for safe data ingestion.

Implement a Data Lakehouse architecture over object storage with a Write-Audit-Publish pattern for safe data ingestion.

Implement a Data Lakehouse architecture over object storage with a Write-Audit-Publish pattern for safe data ingestion.

From unstructured to structured data with LLMs

From unstructured to structured data with LLMs

Transform PDFs into structured tables using a Bauplan LLM pipeline, ensuring versioning, safe experimentation, and seamless analysis.

End-to-End RecSys with MongoDB

End-to-End RecSys with MongoDB

Build a playlist recommender with Bauplan and MongoDB, training embeddings on Spotify data and visualizing results with Streamlit.

End-to-end Machine Learning Pipeline

A streamlined pipeline transforming raw data into a predictions table: prepare a training set and train a Linear Regression model.

Data Augmentation with LLMs

Entity matching in across different datasets, using OpenAI APIs. This example is on e-commerce product catalogs.

Read more

Built with Bauplan

From unstructured to structured data with LLMs

Transform PDFs into structured tables using a Bauplan LLM pipeline, ensuring versioning, safe experimentation, and seamless analysis.

End-to-End RecSys with MongoDB

Build a full-stack music recommender system using Bauplan for data processing and MongoDB for serving. Train embeddings on Spotify playlists, store them in Iceberg and MongoDB, and explore recommendations via a Streamlit app.

Build a Lakehouse with Apache Iceberg in pure Python

Data Lakehouse architecture over object storage with Iceberg tables and robust Write-Audit-Publish workflows for safe data ingestion. Build a Lakehouse in ~150 lines of Python without needing a Data Warehouse, JVM, or Iceberg expertise.


Data quality and expectations

Define data quality constraints with expectations to enforce standards and monitor pipeline updates using blazing fast vectorized tests.

Forecasting ML Pipeline

A streamlined pipeline transforming raw data into a predictions table: prepare a training set and train a Linear Regression model.

Data Augmentation with OpenAI

Entity matching across different e-commerce product catalogs, leveraging off-the-shelf LLM API from OpenAI. The entire project runs on object storage (S3) in open formats (Iceberg), relying solely on vanilla Python to orchestrate the DAG and integrate AI services.


The optimal workflow for data teams

Branch

Instant Zero-Copy

Quickly spin up development and testing environments without duplicating data.

Version Control for Data

Work with data the way you work with code. Use familiar operations like branching, checkout, and merging.

Safe and Sandboxed Experiments

Keep your production environment safe. Collaborate in fully isolated, sandboxed environments.

Instant Zero-Copy Environments

Quickly spin up development and testing environments without duplicating data.

Version Control for Data

Work with data the way you work with code. Use familiar operations like branching, checkout, and merging.

Safe and Sandboxed Experiments

Keep your production environment safe. Collaborate in fully isolated, sandboxed environments.

import bauplan

client = bauplan.Client()

# Create a new branch
branch = client.create_branch(
   branch="dev_branch",
   from_ref="main"
)
print(f'Created branch "{branch.name}"')

# List tables in the branch
for table in client.get_tables(ref=branch):
   print(f"{table.namespace:<12} {table.name:<30} {table.kind}")

import bauplan

client = bauplan.Client()

# Create a new branch
branch = client.create_branch(
   branch="dev_branch",
   from_ref="main"
)
print(f'Created branch "{branch.name}"')

# List tables in the branch
for table in client.get_tables(ref=branch):
   print(f"{table.name:<30} {table.kind}")

Develop

No Infrastructure to Manage

Define environments entirely in code — never worry about containers and environment management.

Pure Python

Build and test data applications directly in your IDE — no need to learn new frameworks, just code as you normally would.

Serverless Functions

Execute your workloads seamlessly in the cloud, combining serverless functions into pipelines.

import bauplan

@bauplan.model()
# Specify Python environment with exact package versions
@bauplan.python(pip={'pandas': '2.2.0'})
def clean_data(
   # Input model reference - points to existing table
   data=bauplan.Model('my_data')
):
   import pandas as pd
   
   # Your data transformation logic here
   ...       
   
   return clean_data

Automate

Merge your changes

Deploy by merging new tables into your main data lake. Use our Python SDK, automate your CI/CD pipelines and deployment.

Built-In testing

Incorporate unit tests and expectations directly into your workflows, ensuring your data is always reliable and consistent.

Effortless Integration

Connect to visualization tools and orchestrators with just one line of code.

import bauplan

client = bauplan.Client()

# create a zero-copy branch of your data lake
client.create_branch(dev_branch, from_ref='main')
# create an Iceberg table and import data in it
client.create_table(table_name, data_source, dev_branch)
client.import_data(table_name, data_source, dev_branch)
# run a pipeline end-to-end in a branch
client.run('./my_project_dir', dev_branch)
# merge the new tables into the main data lake
client.merge_branch(dev_branch, into_branch='main')

print('So Long, and Thanks for All the Fish')

import bauplan

@bauplan.model()
# Specify Python environment with exact package versions
@bauplan.python(pip={'pandas': '2.2.0'})
def clean_data(
   # Input model reference - points to existing table/model
   data=bauplan.Model('my_data')
):
   import pandas as pd
   
   # Your data transformation logic here
   ...       
   
   return clean_data

Develop

Serverless Functions

Execute your workloads seamlessly in the cloud, combining serverless functions into pipelines.

Pure Python

Build and test data applications directly in your IDE — no need to learn new frameworks, just code as you normally would.

No Infrastructure

Define environments entirely in code — never worry about containers and environment management.

Automate

Merge your changes

Deploy by merging new tables into your main data lake. Use our Python SDK, automate your CI/CD pipelines and deployment.

Built-In testing

Incorporate unit tests and expectations directly into your workflows, ensuring your data is always reliable and consistent.

Effortless Integration

Connect to visualization tools and orchestrators with just one line of code.

import bauplan

client = bauplan.Client()

# create a zero-copy branch of your data lake
client.create_branch(dev_branch, from_ref='main')
# create an Iceberg table and import data in it
client.create_table(table_name, data_source, dev_branch)
client.import_data(table_name, data_source, dev_branch)
# run a pipeline end-to-end in a branch
client.run('./my_project_dir', dev_branch)
# merge the new tables into the main data lake
client.merge_branch(dev_branch, into_branch='main')

print('So Long, and Thanks for All the Fish')

Branch

Instant Zero-Copy Environments

Quickly spin up development and testing environments without duplicating data.

Version Control for Data

Work with data the way you work with code. Use familiar operations like branching, checkout, and merging.

Safe and Sandboxed Experiments

Keep your production environment safe. Collaborate in fully isolated, sandboxed environments.

import bauplan

client = bauplan.Client()

# Create a new branch
branch = client.create_branch(
   branch="dev_branch",
   from_ref="main"
)
print(f'Created branch "{branch.name}"')

# List tables in the branch
for table in client.get_tables(ref=branch):
   print(f"{table.namespace:<12} {table.name:<30} {table.kind}")

import bauplan

@bauplan.model()
# Specify Python environment with exact package versions
@bauplan.python(pip={'pandas': '2.2.0'})
def clean_data(
   # Input model reference - points to existing table/model
   data=bauplan.Model('my_data')
):
   import pandas as pd
   
   # Your data transformation logic here
   ...       
   
   return clean_data

Develop

Serverless Functions

Execute your workloads seamlessly in the cloud, combining serverless functions into pipelines.

Pure Python

Build and test data applications directly in your IDE — no need to learn new frameworks, just code as you normally would.

No Infrastructure to Manage

Define environments entirely in code — never worry about containers and environment management.

Develop

No Infrastructure to Manage

Define environments entirely in code — never worry about containers and environment management.

Pure Python

Build and test data applications directly in your IDE — no need to learn new frameworks, just code as you normally would.

Serverless Functions

Execute your workloads seamlessly in the cloud, combining serverless functions into pipelines.

import bauplan

@bauplan.model()
# Specify Python environment with exact package versions
@bauplan.python(pip={'pandas': '2.2.0'})
def clean_data(
   # Input model reference - points to existing table/model
   data=bauplan.Model('my_data')
):
   import pandas as pd
   
   # Your data transformation logic here
   ...       
   
   return clean_data

Automate

Merge your changes

Deploy by merging new tables into your main data lake. Use our Python SDK, automate your CI/CD pipelines and deployment.

Built-In testing

Incorporate unit tests and expectations directly into your workflows, ensuring your data is always reliable and consistent.

Effortless Integration

Connect to visualization tools and orchestrators with just one line of code.

import bauplan

client = bauplan.Client()

# create a zero-copy branch of your data lake
client.create_branch(dev_branch, from_ref='main')
# create an Iceberg table and import data in it
client.create_table(table_name, data_source, dev_branch)
client.import_data(table_name, data_source, dev_branch)
# run a pipeline end-to-end in a branch
client.run('./my_project_dir', dev_branch)
# merge the new tables into the main data lake
client.merge_branch(dev_branch, into_branch='main')

print('So Long, and Thanks for All the Fish')

Try bauplan for free

Create your sandbox and start building.

Try bauplan for free

Create your sandbox and start building.

Try bauplan for free

Create your sandbox and start building.