marimo-learn / duckdb /01_getting_started.py
Azmi-84
Enhance and redesign DuckDB introductory notebook
e5fc993
raw
history blame
43.7 kB
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "marimo",
# "duckdb==1.2.2",
# "polars==1.27.0",
# "numpy==2.2.4",
# "pyarrow==19.0.1",
# "pandas==2.2.3",
# "sqlglot==26.12.1",
# "plotly==5.23.1",
# ]
# ///
import marimo
__generated_with = "0.13.4"
app = marimo.App(width="medium")
@app.cell(hide_code=True)
def _(mo):
mo.md(
rf"""
<p align="center">
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSxHAqB0W_61zuIGVMiU6sEeQyTaw-9xwiprw&s" alt="DuckDB Image"/>
</p>
"""
)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
rf"""
# 🦆 **DuckDB**: An Embeddable Analytical Database System
## What is DuckDB?
[DuckDB](https://duckdb.org/) is a _high-performance_, in-process, embeddable SQL OLAP (Online Analytical Processing) Database Management System (DBMS) designed for simplicity and speed. It's essentially a fully-featured database that runs directly within your application's process, without needing a separate server. This makes it excellent for complex analytical workloads, offering a robust SQL interface and efficient processing – perfect for learning about databases and data analysis concepts. It's a great alternative to heavier database systems like PostgreSQL or MySQL when you don't need a full-blown server.
---
## ⚡ Key Features
| Feature | Description |
|:---------|:-------------|
| **In-Process Architecture** | Runs directly within your application's memory space - no separate server needed, simplifying deployment |
| **Columnar Storage** | Data stored in columns instead of rows, dramatically improving performance for analytical queries |
| **Vectorized Execution** | Performs operations on entire columns at once, significantly speeding up data processing |
| **ACID Transactions** | Ensures data integrity and reliability across operations |
| **Multi-Language Support** | Provides APIs for `Python`, `R`, `Java`, `C++`, and more |
| **Zero External Dependencies** | Minimal dependencies, making setup and deployment straightforward |
| **High Portability** | Works across various operating systems (Windows, macOS, Linux) and hardware architectures |
---
## [Use Cases](https://github.com/davidgasquez/awesome-duckdb?tab=readme-ov-file):
- **Data Analysis and Exploration:** DuckDB is ideal for quickly querying and analyzing datasets, especially for initial exploratory analysis.
- **Embedded Analytics in Applications:** You can integrate DuckDB directly into your applications to provide analytical capabilities without the need for a separate database server.
- **ETL (Extract, Transform, Load) Processes:** DuckDB can be used to perform initial data transformation and cleaning steps as part of an ETL pipeline.
- **Data Science and Machine Learning Workflows:** It's a lightweight alternative to larger databases for prototyping data analysis and machine learning models.
- **Rapid Prototyping of Data Analysis Pipelines:** Quickly test and iterate on data analysis ideas without the complexity of setting up a full-blown database environment.
- **Small to Medium Datasets:** DuckDB shines when working with datasets that don't require the massive scalability of a traditional database server.
---
### [Installation](https://duckdb.org/docs/installation/?version=stable&environment=python):
- Python installation:
```
pip install duckdb
```
```
conda install python-duckdb -c conda-forge.
```
<!-- >**_Note_:** DuckDB requires Python 3.7 or newer. You also need to have Python and `pip` or `conda` installed on your system. -->
/// attention | Note
DuckDB requires Python 3.7 or newer. You also need to have Python and `pip` or `conda` installed on your system.
///
"""
)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
# [1. DuckDB Connections: In-Memory vs. File-based](https://duckdb.org/docs/stable/connect/overview.html)
DuckDB is a lightweight, _relational database management system (RDBMS)_ designed for analytical workloads. Unlike traditional client-server databases, it operates _in-process_ (embedded within your application) and supports both _in-memory_ (temporary) and _file-based_ (persistent) storage.
---
| Feature | In-Memory Connection | File-Based Connection |
|:---------|:---------------------|:----------------------|
| Persistence | Temporary (lost when session ends) | Stored on disk (persists between sessions) |
| Use Cases | Quick analysis, ephemeral data, testing | Long-term storage, data that needs to be accessed later |
| Performance | Faster for most operations | Slightly slower but provides persistence |
| Creation | duckdb.connect(':memory:') | duckdb.connect('filename.db') |
| Multiple Connection Access | Limited to single connection | Multiple connections can access the same database |
"""
)
return
@app.cell
def _(os):
# Remove previous database if it exists
if os.path.exists("example.db"):
os.remove("example.db")
if not os.path.exists("data"):
os.makedirs("data")
return
@app.cell
def _(mo):
_df = mo.sql(
f"""
-- Print the DuckDB version
SELECT version() AS version_info
"""
)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
"""
## Creating DuckDB Connections
Let's create both types of DuckDB connections and explore their characteristics.
1. **In-memory connection**: Data exists only during the current session
2. **File-based connection**: Data persists between sessions
We'll then demonstrate the key differences between these connection types.
"""
)
return
@app.cell
def _(duckdb):
# Create an in-memory DuckDB connection
memory_db = duckdb.connect(":memory:")
# Create a file-based DuckDB connection
file_db = duckdb.connect("example.db")
return file_db, memory_db
@app.cell
def _(file_db, memory_db):
# Test both connections
memory_db.execute(
"CREATE TABLE IF NOT EXISTS mem_test (id INTEGER, name VARCHAR)"
)
memory_db.execute("INSERT INTO mem_test VALUES (1, 'Memory Test')")
file_db.execute(
"CREATE TABLE IF NOT EXISTS file_test (id INTEGER, name VARCHAR)"
)
file_db.execute("INSERT INTO file_test VALUES (1, 'File Test')")
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
## Testing Connection Persistence
Let's demonstrate how in-memory databases are ephemeral, while file-based databases persist.
1. First, we'll query our tables to confirm the data was properly inserted
2. Then, we'll simulate an application restart by creating new connections
3. Finally, we'll check which data persists after the "restart"
"""
)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""## Current Database Contents""")
return
@app.cell
def _(mem_test, memory_db, mo):
_df = mo.sql(
f"""
SELECT * FROM mem_test
""",
engine=memory_db
)
return
@app.cell
def _(file_db, file_test, mo):
_df = mo.sql(
f"""
SELECT * FROM file_test
""",
engine=file_db
)
return
@app.cell
def _():
# We don't actually close the connections here since we need them for later cells
# Just a placeholder for the concept
return
@app.cell(hide_code=True)
def _file_query(mo):
mo.md(rf"""## 🔄 Simulating Application Restart...""")
return
@app.cell
def _(duckdb):
# Create new connections (simulating restart)
new_memory_db = duckdb.connect(":memory:")
new_file_db = duckdb.connect("example.db")
return new_file_db, new_memory_db
@app.cell
def _(new_memory_db):
# Try to query tables in the new memory connection
try:
new_memory_db.execute("SELECT * FROM mem_test").df()
memory_persistence = "✅ Data persisted in memory (unexpected)"
memory_data_available = True
except Exception as e:
memory_persistence = "❌ Data lost from memory (expected behavior)"
memory_data_available = False
return memory_data_available, memory_persistence
@app.cell
def _(new_file_db):
# Try to query tables in the new file connection
try:
file_data = new_file_db.execute("SELECT * FROM file_test").df()
file_persistence = "✅ Data persisted in file (expected behavior)"
file_data_available = True
except Exception as e:
file_persistence = "❌ Data lost from file (unexpected)"
file_data_available = False
file_data = None
return file_data, file_data_available, file_persistence
@app.cell
def _(
file_data_available,
file_persistence,
memory_data_available,
memory_persistence,
mo,
):
# Create an interactive display to show persistence results
persistence_results = mo.ui.table(
{
"Connection Type": ["In-Memory Database", "File-Based Database"],
"Persistence Status": [memory_persistence, file_persistence],
"Data Available After Restart": [
memory_data_available,
file_data_available,
],
}
)
mo.md("### Persistence Test Results")
return (persistence_results,)
@app.cell
def _(persistence_results):
persistence_results
return
@app.cell
def _(file_data, file_data_available, mo):
if file_data_available:
mo.md("### Persisted File-Based Data:")
mo.ui.table(file_data)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
# [2. Creating Tables in DuckDB](https://duckdb.org/docs/stable/sql/statements/create_table.html)
DuckDB supports standard SQL syntax for creating tables. Let's create more complex tables to demonstrate different data types and constraints.
## Table Creation Options
DuckDB supports various table creation options, including:
- **Basic tables** with column definitions
- **Temporary tables** that exist only during the session
- **CREATE OR REPLACE** to recreate tables
- **Primary keys** and other constraints
- **Various data types** including INTEGER, VARCHAR, TIMESTAMP, DECIMAL, etc.
"""
)
return
@app.cell
def _create_users_tables(file_db, new_memory_db):
# For the memory database
try:
new_memory_db.execute("DROP TABLE IF EXISTS users_memory")
except:
pass
# For the file database
try:
file_db.execute("DROP TABLE IF EXISTS users_file")
except:
pass
return
@app.cell
def _(file_db, new_memory_db):
# Create advanced users table in memory database with primary key
new_memory_db.execute("""
CREATE TABLE users_memory (
id INTEGER PRIMARY KEY,
name VARCHAR NOT NULL,
age INTEGER CHECK (age > 0),
email VARCHAR UNIQUE,
registration_date DATE DEFAULT CURRENT_DATE,
last_login TIMESTAMP,
account_balance DECIMAL(10,2) DEFAULT 0.00
)
""")
# Create users table in file database
file_db.execute("""
CREATE TABLE users_file (
id INTEGER PRIMARY KEY,
name VARCHAR NOT NULL,
age INTEGER CHECK (age > 0),
email VARCHAR UNIQUE,
registration_date DATE DEFAULT CURRENT_DATE,
last_login TIMESTAMP,
account_balance DECIMAL(10,2) DEFAULT 0.00
)
""")
return
@app.cell
def _(mo, new_memory_db):
# Get table schema information using DuckDB's internal system tables
memory_schema = new_memory_db.execute("""
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'users_memory'
ORDER BY ordinal_position
""").df()
# Display the schema using marimo's UI components
mo.md("### 🔍 Table Schema Information")
return (memory_schema,)
@app.cell(hide_code=True)
def _(memory_schema, mo):
mo.ui.table(memory_schema)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
# [3. Inserting Data Into Tables](https://duckdb.org/docs/stable/sql/statements/insert)
DuckDB supports multiple ways to insert data:
1. **INSERT INTO VALUES**: Insert specific values
2. **INSERT INTO SELECT**: Insert data from query results
3. **Parameterized inserts**: Using prepared statements
4. **Bulk inserts**: For efficient loading of multiple rows
Let's demonstrate these different insertion methods:
"""
)
return
@app.cell
def _insert_user_data(date):
today = date.today()
# First check if records already exist to avoid duplicate key errors
def safe_insert(connection, table_name, data):
"""
Safely insert data into a table by checking for existing IDs first
"""
# Check which IDs already exist in the table
existing_ids = (
connection.execute(f"SELECT id FROM {table_name}")
.fetchdf()["id"]
.tolist()
)
# Filter out data with IDs that already exist
new_data = [record for record in data if record[0] not in existing_ids]
if not new_data:
print(
f"No new records to insert into {table_name}. All IDs already exist."
)
return 0
# Prepare the placeholders for the SQL statement
placeholders = ", ".join(
["(" + ", ".join(["?"] * len(new_data[0])) + ")"] * len(new_data)
)
# Flatten the list of tuples for parameter binding
flat_data = [item for sublist in new_data for item in sublist]
# Perform the insertion
if flat_data:
columns = "(id, name, age, email, registration_date, last_login, account_balance)"
connection.execute(
f"INSERT INTO {table_name} {columns} VALUES {placeholders}",
flat_data,
)
return len(new_data)
return 0
return (safe_insert,)
@app.cell
def _():
# Prepare the data
user_data = [
(
1,
"Alice",
25,
"alice@example.com",
"2021-01-01",
"2023-01-15 14:30:00",
1250.75,
),
(
2,
"Bob",
30,
"bob@example.com",
"2021-02-01",
"2023-02-10 09:15:22",
750.50,
),
(
3,
"Charlie",
35,
"charlie@example.com",
"2021-03-01",
"2023-03-05 17:45:10",
3200.25,
),
(
4,
"David",
40,
"david@example.com",
"2021-04-01",
"2023-04-20 10:30:45",
1800.00,
),
(
5,
"Emma",
45,
"emma@example.com",
"2021-05-01",
"2023-05-12 11:20:30",
2500.00,
),
(
6,
"Frank",
50,
"frank@example.com",
"2021-06-01",
"2023-06-18 16:10:15",
900.25,
),
]
return (user_data,)
@app.cell
def _(mo, new_memory_db, safe_insert, user_data):
# Safely insert data into memory database
records_inserted = safe_insert(new_memory_db, "users_memory", user_data)
mo.md(
f"""
Inserted {records_inserted} new records into users_memory.
"""
)
return
@app.cell
def _(file_db, safe_insert, user_data):
def _():
# Safely insert data into file database
records_inserted = safe_insert(file_db, "users_file", user_data)
return print(f"Inserted {records_inserted} new records into users_file")
_()
return
@app.cell
def _():
# If you need to add just one record, you can use a similar approach:
new_user = (
7,
"Grace",
28,
"grace@example.com",
"2021-07-01",
"2023-07-22 13:45:10",
1675.50,
)
return (new_user,)
@app.cell
def _(new_memory_db, new_user):
# Check if the ID exists before inserting
if not new_memory_db.execute(
"SELECT id FROM users_memory WHERE id = ?", [new_user[0]]
).fetchone():
new_memory_db.execute(
"""
INSERT INTO users_memory (id, name, age, email, registration_date, last_login, account_balance)
VALUES (?, ?, ?, ?, ?, ?, ?)
""",
new_user,
)
print(f"Added user {new_user[1]} to users_memory")
else:
print(f"User with ID {new_user[0]} already exists in users_memory")
return
@app.cell
def _(file_db, new_user):
# Do the same for the file database
if not file_db.execute(
"SELECT id FROM users_file WHERE id = ?", [new_user[0]]
).fetchone():
file_db.execute(
"""
INSERT INTO users_file (id, name, age, email, registration_date, last_login, account_balance)
VALUES (?, ?, ?, ?, ?, ?, ?)
""",
new_user,
)
print(f"Added user {new_user[1]} to users_file")
else:
print(f"User with ID {new_user[0]} already exists in users_file")
return
@app.cell
def _(new_memory_db):
# First try to update
cursor = new_memory_db.execute(
"""
UPDATE users_memory
SET name = ?, age = ?, email = ?,
registration_date = ?, last_login = ?, account_balance = ?
WHERE id = ?
""",
(
"Henry",
33,
"henry@example.com",
"2021-08-01",
"2023-08-05 09:10:15",
3100.75,
8, # ID should be the last parameter
),
)
return (cursor,)
@app.cell
def _(cursor, mo, new_memory_db):
# If no rows were updated, perform an insert
if cursor.rowcount == 0:
new_memory_db.execute(
"""
INSERT INTO users_memory
(id, name, age, email, registration_date, last_login, account_balance)
VALUES (?, ?, ?, ?, ?, ?, ?)
""",
(
8,
"Henry",
33,
"henry@example.com",
"2021-08-01",
"2023-08-05 09:10:15",
3100.75,
),
)
mo.md(
f"""
Upserted Henry into users_memory.
"""
)
return
@app.cell
def _(file_db, mo):
# For DuckDB using ON CONFLICT, we need to specify the conflict target column
file_db.execute(
"""
INSERT INTO users_file (id, name, age, email, registration_date, last_login, account_balance)
VALUES (?, ?, ?, ?, ?, ?, ?)
ON CONFLICT (id) DO UPDATE SET
name = EXCLUDED.name,
age = EXCLUDED.age,
email = EXCLUDED.email,
registration_date = EXCLUDED.registration_date,
last_login = EXCLUDED.last_login,
account_balance = EXCLUDED.account_balance
""",
(
8,
"Henry",
33,
"henry@example.com",
"2021-08-01",
"2023-08-05 09:10:15",
3100.75,
),
)
mo.md(
f"""
Upserted Henry into users_file.
"""
)
return
@app.cell
def _view_tables_after_insert(new_memory_db):
# Display memory data using DuckDB's query capabilities
memory_results = new_memory_db.execute("""
SELECT
id,
name,
age,
email,
registration_date,
last_login,
account_balance
FROM users_memory
ORDER BY id
""").df()
return (memory_results,)
@app.cell
def _(file_db):
# Display file data with formatting
file_results = file_db.execute("""
SELECT
id,
name,
age,
email,
registration_date,
last_login,
CAST(account_balance AS DECIMAL(10,2)) AS account_balance
FROM users_file
ORDER BY id
""").df()
return (file_results,)
@app.cell
def _(mo):
mo.md(
r"""
<!-- Create an interactive display with tabs using marimo components -->
## 📊 Database Contents After Insertion
"""
)
return
@app.cell(hide_code=True)
def _(file_results, memory_results, mo):
tabs = mo.ui.tabs(
{
"In-Memory Database": mo.ui.table(memory_results),
"File-Based Database": mo.ui.table(file_results),
}
)
tabs
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
# [4. Using SQL Directly in Marimo](https://duckdb.org/docs/stable/sql/query_syntax/select)
There are multiple ways to leverage DuckDB's SQL capabilities in marimo:
1. **Direct execution**: Using DuckDB connections to execute SQL
2. **Marimo SQL**: Using Marimo's built-in SQL engine
3. **Interactive queries**: Combining UI elements with SQL execution
Let's explore these approaches:
"""
)
return
@app.cell(hide_code=True)
def _sql_with_marimo(mo):
mo.md(
rf"""
<!-- Using Marimo's SQL engine with direct SQL on memory_results DataFrame -->
## 🔍 Query with Marimo SQL
"""
)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
rf"""
## Marimo has its own built-in SQL engine that can work with DataFrames.
Let's use it to filter our users:
"""
)
return
@app.cell
def _(mo):
# Create a SQL selector for users with age threshold
age_threshold = mo.ui.slider(25, 50, value=30, label="Minimum Age")
return (age_threshold,)
@app.cell
def _(age_threshold, memory_results, mo):
# Create a function to filter users based on the slider value
def filtered_users():
# Use DuckDB directly instead of mo.sql with users param
filtered_df = memory_results[memory_results["age"] >= age_threshold.value]
filtered_df = filtered_df.sort_values("age")
return mo.ui.table(filtered_df)
return (filtered_users,)
@app.cell
def _(age_threshold, filtered_users, mo):
layout = mo.vstack(
[
mo.md("### Select minimum age:"),
age_threshold,
mo.md("### Users meeting age criteria:"),
filtered_users(),
],
gap=1.5,
)
layout
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""# [5. Working with Polars and DuckDB](https://duckdb.org/docs/stable/guides/python/polars.html)""")
return
@app.cell
def _polars_integration(pl):
# Create a Polars DataFrame
polars_df = pl.DataFrame(
{
"id": [101, 102, 103],
"name": ["Product A", "Product B", "Product C"],
"price": [29.99, 49.99, 19.99],
"category": ["Electronics", "Furniture", "Books"],
}
)
return (polars_df,)
@app.cell
def _(mo):
mo.md(
rf"""
<!-- Display the Polars DataFrame -->
## Original Polars DataFrame:
"""
)
return
@app.cell
def _(mo, polars_df):
mo.ui.table(polars_df)
return
@app.cell
def _(new_memory_db, polars_df):
# Register the Polars DataFrame as a DuckDB table in memory connection
new_memory_db.register("products_polars", polars_df)
# Query the registered table
polars_query_result = new_memory_db.execute(
"SELECT * FROM products_polars WHERE price > 25"
).df()
return (polars_query_result,)
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
<!-- Display the query result -->
## DuckDB Query Result (From Polars Data):
"""
)
return
@app.cell
def _(mo, polars_query_result):
mo.ui.table(polars_query_result)
return
@app.cell
def _(mo, new_memory_db):
# Demonstrate a more complex query
complex_query_result = new_memory_db.execute("""
SELECT
category,
COUNT(*) as product_count,
AVG(price) as avg_price,
MIN(price) as min_price,
MAX(price) as max_price
FROM products_polars
GROUP BY category
ORDER BY avg_price DESC
""").df()
mo.md("## Aggregated Product Data by Category:")
return (complex_query_result,)
@app.cell
def _(complex_query_result, mo):
mo.ui.table(complex_query_result)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""# [6. Advanced Queries: Joins Between Tables](https://duckdb.org/docs/stable/guides/performance/join_operations.html)""")
return
@app.cell
def _join_operations(new_memory_db):
# Create another table to join with
new_memory_db.execute("""
CREATE TABLE IF NOT EXISTS departments (
id INTEGER,
department_name VARCHAR,
manager_id INTEGER
)
""")
return
@app.cell
def _(new_memory_db):
new_memory_db.execute("""
INSERT INTO departments VALUES
(101, 'Engineering', 1),
(102, 'Marketing', 2),
(103, 'Finance', NULL)
""")
return
@app.cell
def _(new_memory_db):
# Execute a join query
join_result = new_memory_db.execute("""
SELECT
u.id,
u.name,
u.age,
d.department_name
FROM users_memory u
LEFT JOIN departments d ON u.id = d.manager_id
ORDER BY u.id
""").df()
return (join_result,)
@app.cell(hide_code=True)
def _(mo):
mo.md(
rf"""
<!-- Display the join result -->
## Join Result (Users and Departments):
"""
)
return
@app.cell
def _(join_result, mo):
mo.ui.table(join_result)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
rf"""
<!-- Demonstrate different types of joins -->
## Different Types of Joins
"""
)
return
@app.cell
def _(new_memory_db):
# Inner join
inner_join = new_memory_db.execute("""
SELECT u.id, u.name, d.department_name
FROM users_memory u
INNER JOIN departments d ON u.id = d.manager_id
""").df()
# Right join
right_join = new_memory_db.execute("""
SELECT u.id, u.name, d.department_name
FROM users_memory u
RIGHT JOIN departments d ON u.id = d.manager_id
""").df()
# Full outer join
full_join = new_memory_db.execute("""
SELECT u.id, u.name, d.department_name
FROM users_memory u
FULL OUTER JOIN departments d ON u.id = d.manager_id
""").df()
return full_join, inner_join, right_join
@app.cell
def _(full_join, inner_join, join_result, mo, right_join):
join_tabs = mo.ui.tabs(
{
"Left Join": mo.ui.table(join_result),
"Inner Join": mo.ui.table(inner_join),
"Right Join": mo.ui.table(right_join),
"Full Outer Join": mo.ui.table(full_join),
}
)
join_tabs
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""# [7. Aggregate Functions in DuckDB](https://duckdb.org/docs/stable/sql/functions/aggregates.html)""")
return
@app.cell
def _aggregate_operations(new_memory_db):
# Execute an aggregate query
agg_result = new_memory_db.execute("""
SELECT
AVG(age) as avg_age,
MAX(age) as max_age,
MIN(age) as min_age,
COUNT(*) as total_users,
SUM(account_balance) as total_balance
FROM users_memory
""").df()
return (agg_result,)
@app.cell(hide_code=True)
def _(mo):
mo.md(
rf"""
<!-- Display the aggregate result -->
## Aggregate Results (All Users):
"""
)
return
@app.cell
def _(agg_result, mo):
mo.ui.table(agg_result)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
rf"""
<!-- More complex aggregate query with grouping -->
## Aggregate Results (Grouped by Age Range):
"""
)
return
@app.cell
def _(new_memory_db):
age_groups = new_memory_db.execute("""
SELECT
CASE
WHEN age < 30 THEN 'Under 30'
WHEN age BETWEEN 30 AND 40 THEN '30 to 40'
ELSE 'Over 40'
END as age_group,
COUNT(*) as count,
AVG(age) as avg_age,
AVG(account_balance) as avg_balance
FROM users_memory
GROUP BY 1
ORDER BY 1
""").df()
return (age_groups,)
@app.cell
def _(age_groups, mo):
mo.ui.table(age_groups)
return
@app.cell
def _(mo):
mo.md(
r"""
<!-- Window functions demo -->
### Window Functions Example:
"""
)
return
@app.cell
def _(mo, new_memory_db):
window_result = new_memory_db.execute("""
SELECT
id,
name,
age,
account_balance,
RANK() OVER (ORDER BY account_balance DESC) as balance_rank,
account_balance - AVG(account_balance) OVER () as diff_from_avg,
account_balance / SUM(account_balance) OVER () * 100 as pct_of_total
FROM users_memory
ORDER BY balance_rank
""").df()
mo.ui.table(window_result)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""# [8. Converting DuckDB Results to Polars/Pandas](https://duckdb.org/docs/stable/guides/python/polars.html)""")
return
@app.cell
def _convert_results(new_memory_db):
polars_result = new_memory_db.execute(
"""SELECT * FROM users_memory WHERE age > 25 ORDER BY age"""
).pl()
return (polars_result,)
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
<!-- Display the converted results -->
## Query Result as Polars DataFrame:
"""
)
return
@app.cell
def _(mo, polars_result):
mo.ui.table(polars_result)
return
@app.cell
def _(new_memory_db):
pandas_result = new_memory_db.execute(
"""SELECT * FROM users_memory WHERE age > 25 ORDER BY age"""
).fetch_df()
return (pandas_result,)
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""## Same Query Result as Pandas DataFrame:""")
return
@app.cell
def _(mo, pandas_result):
mo.ui.table(pandas_result)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
<!-- Demonstrate the differences in handling -->
## Differences in DataFrame Handling
"""
)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
<!-- Polars operation -->
## Polars: Filter users over 35 and calculate average balance
"""
)
return
@app.cell
def _(mo, pl, polars_result):
def _():
polars_filtered = polars_result.filter(pl.col("age") > 35)
polars_avg = polars_filtered.select(
pl.col("account_balance").mean().alias("avg_balance")
)
layout = mo.vstack(
[
mo.md("### Filtered Polars DataFrame (Age > 35):"),
mo.ui.table(polars_filtered),
mo.md("### Average Account Balance:"),
mo.ui.table(polars_avg),
],
gap=1.5,
)
return layout
_()
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
<!-- Pandas equivalent (using pandas style) -->
## Pandas: Same operation in pandas style
"""
)
return
@app.cell
def _(mo, pandas_result):
pandas_avg = pandas_result[pandas_result["age"] > 35]["account_balance"].mean()
mo.md(f"Average balance: {pandas_avg:.2f}")
return
@app.cell(hide_code=True)
def _(mo):
mo.md("""## 9. Data Visualization with DuckDB and Plotly""")
return
@app.cell
def _(age_groups, mo, new_memory_db, plotly_express):
# User distribution by age group
fig1 = plotly_express.bar(
age_groups,
x="age_group",
y="count",
title="User Distribution by Age Group",
labels={"count": "Number of Users", "age_group": "Age Group"},
color="age_group",
color_discrete_sequence=plotly_express.colors.qualitative.Plotly,
)
fig1.update_traces(
text=age_groups["count"],
textposition="outside",
)
fig1.update_layout(height=450, margin=dict(t=50, b=50))
# Average balance by age group
fig2 = plotly_express.bar(
age_groups,
x="age_group",
y="avg_balance",
title="Average Account Balance by Age Group",
labels={"avg_balance": "Average Balance ($)", "age_group": "Age Group"},
color="age_group",
color_discrete_sequence=plotly_express.colors.qualitative.Plotly,
)
fig2.update_traces(
text=[f"${val:.2f}" for val in age_groups["avg_balance"]],
textposition="outside",
)
fig2.update_layout(height=450, margin=dict(t=50, b=50))
# Age vs Account Balance scatter plot
scatter_data = new_memory_db.execute(
"""
SELECT
name,
age,
account_balance
FROM users_memory
ORDER BY age
"""
).df()
fig3 = plotly_express.scatter(
scatter_data,
x="age",
y="account_balance",
title="Age vs. Account Balance",
labels={"account_balance": "Account Balance ($)", "age": "Age"},
color_discrete_sequence=["#FF7F0E"],
trendline="ols",
hover_data=["age", "account_balance"],
size_max=15,
)
fig3.update_traces(marker=dict(size=12))
fig3.update_layout(height=450, margin=dict(t=50, b=50))
# Distribution of account balances
balance_data = new_memory_db.execute(
"""
SELECT
name,
account_balance
FROM users_memory
ORDER BY account_balance DESC
"""
).df()
fig4 = plotly_express.pie(
balance_data,
names="name",
values="account_balance",
title="Distribution of Account Balances",
labels={"account_balance": "Account Balance ($)", "name": "User"},
color_discrete_sequence=plotly_express.colors.qualitative.Pastel,
)
fig4.update_traces(textinfo="percent+label", textposition="inside")
fig4.update_layout(height=450, margin=dict(t=50, b=50))
category_tabs = mo.ui.tabs(
{
"Age Group Analysis": mo.vstack(
[
mo.ui.tabs(
{
"User Distribution": mo.ui.plotly(fig1),
"Average Balance": mo.ui.plotly(fig2),
}
)
]
),
"Financial Analysis": mo.vstack(
[
mo.ui.tabs(
{
"Age vs Balance": mo.ui.plotly(fig3),
"Balance Distribution": mo.ui.plotly(fig4),
}
)
]
),
},
lazy=True,
)
mo.vstack(
[
mo.md("### Select a visualization category:"),
category_tabs,
],
gap=1.5,
)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
# [9. Database Management Best Practices]
### Closing Connections
It's important to close database connections when you're done with them, especially for file-based connections:
```python
memory_db.close()
file_db.close()
```
### Transaction Management
DuckDB supports transactions, which can be useful for more complex operations:
```python
conn = duckdb.connect('mydb.db')
conn.begin() # Start transaction
try:
conn.execute("INSERT INTO users VALUES (1, 'Test User')")
conn.execute("UPDATE balances SET amount = amount - 100 WHERE user_id = 1")
conn.commit() # Commit changes
except:
conn.rollback() # Undo changes if error
raise
```
### Query Performance
DuckDB is optimized for analytical queries. For best performance:
- Use appropriate data types
- Create indexes for frequently queried columns
- For large datasets, consider partitioning
- Use prepared statements for repeated queries
"""
)
return
@app.cell(hide_code=True)
def _interactive_dashboard(mo):
mo.md(rf"""## 10. Interactive DuckDB Dashboard with Marimo and Plotly""")
return
@app.cell
def _(mo):
# Create an interactive filter for age range
min_age = mo.ui.slider(20, 50, value=25, label="Minimum Age")
max_age = mo.ui.slider(20, 50, value=50, label="Maximum Age")
return max_age, min_age
@app.cell
def _(max_age, min_age, new_memory_db):
# Create a function to filter data and update visualizations
def get_filtered_data(min_val=min_age.value, max_val=max_age.value):
# Get filtered data based on slider values using parameterized query for safety
return new_memory_db.execute(
"""
SELECT
id,
name,
age,
email,
account_balance,
registration_date
FROM users_memory
WHERE age >= ? AND age <= ?
ORDER BY age
""",
[min_val, max_val],
).df()
return (get_filtered_data,)
@app.cell
def _(get_filtered_data):
def get_metrics(data=get_filtered_data()):
return {
"user count": len(data),
"avg_balance": data["account_balance"].mean() if len(data) > 0 else 0,
"total_balance": data["account_balance"].sum() if len(data) > 0 else 0,
}
return (get_metrics,)
@app.cell
def _(get_metrics, mo):
def metrics_display(metrics=get_metrics()):
return mo.hstack(
[
mo.vstack(
[
mo.md("### Selected Users"),
mo.md(f"## {metrics['user count']}"),
],
align="center",
),
mo.vstack(
[
mo.md("### Average Balance"),
mo.md(f"## ${metrics['avg_balance']:.2f}"),
],
align="center",
),
mo.vstack(
[
mo.md("### Total Balance"),
mo.md(f"## ${metrics['total_balance']:.2f}"),
],
align="center",
),
],
justify="space-between",
gap=1.5,
)
return (metrics_display,)
@app.cell
def _(get_filtered_data, max_age, min_age, mo, plotly_express):
def create_visualization(
data=get_filtered_data(), min_val=min_age.value, max_val=max_age.value
):
if len(data) == 0:
return mo.ui.text("No data available for the selected age range.")
# Create visualizations for filtered data
fig1 = plotly_express.bar(
data,
x="name",
y="account_balance",
title=f"Account Balance by User (Age {min_val} - {max_val})",
labels={"account_balance": "Account Balance ($)", "name": "User"},
color="account_balance",
color_continuous_scale=plotly_express.colors.sequential.Plasma,
text_auto=".2s",
)
fig1.update_layout(
height=400,
xaxis_tickangle=-45,
margin=dict(t=50, b=70, l=50, r=30),
)
fig1.update_traces(
textposition="outside",
)
fig2 = plotly_express.histogram(
data,
x="age",
nbins=min(10, len(set(data["age"]))),
title=f"Age Distribution (Age {min_val} - {max_val})",
color_discrete_sequence=["#4C78A8"],
opacity=0.8,
histnorm="probability density",
)
fig2.update_layout(
height=400,
margin=dict(t=50, b=70, l=50, r=30),
bargap=0.1,
)
fig3 = plotly_express.scatter(
data,
x="age",
y="account_balance",
title=f"Age vs. Account Balance (Age {min_val} - {max_val})",
labels={"account_balance": "Account Balance ($)", "age": "Age"},
color="age",
color_continuous_scale="Viridis",
size_max=25,
size="account_balance",
hover_name="name",
)
fig3.update_layout(
height=400,
margin=dict(t=50, b=70, l=50, r=30),
)
return mo.ui.tabs(
{
"Account Balance by User": mo.ui.plotly(fig1),
"Age Distribution": mo.ui.plotly(fig2),
"Age vs. Account Balance": mo.ui.plotly(fig3),
}
)
return (create_visualization,)
@app.cell
def _(
create_visualization,
get_filtered_data,
max_age,
metrics_display,
min_age,
mo,
):
def dashboard(
min_val=min_age.value,
max_val=max_age.value,
metrics=metrics_display(),
data=get_filtered_data(),
visualization=create_visualization()
):
return mo.vstack(
[
mo.md(f"### Interactive Dashboard (Age {min_val} - {max_val})"),
metrics,
mo.md("### Data Table"),
mo.ui.table(data, page_size=5),
mo.md("### Visualizations"),
visualization,
],
gap=2
)
dashboard()
return
@app.cell(hide_code=True)
def _conclusion(mo):
mo.md(
rf"""
# Summary and Key Takeaways
In this notebook, we've explored DuckDB, a powerful embedded analytical database system. Here's what we covered:
1. **Connection types**: We learned the difference between in-memory databases (temporary) and file-based databases (persistent).
2. **Table creation**: We created tables with various data types, constraints, and primary keys.
3. **Data insertion**: We demonstrated different ways to insert data, including single inserts and bulk loading.
4. **SQL queries**: We executed various SQL queries directly and through Marimo's UI components.
5. **Integration with Polars**: We showed how DuckDB can work seamlessly with Polars DataFrames.
6. **Joins and relationships**: We performed JOIN operations between tables to combine related data.
7. **Aggregation**: We used aggregate functions to summarize and analyze data.
8. **Data conversion**: We converted DuckDB results to both Polars and Pandas DataFrames.
9. **Best practices**: We reviewed best practices for managing DuckDB connections and transactions.
10. **Visualization**: We created interactive visualizations and dashboards with Plotly and Marimo.
DuckDB is an excellent tool for data analysis, especially for analytical workloads. Its in-process nature makes it fast and easy to use, while its SQL compatibility makes it accessible for anyone familiar with SQL databases.
### Next Steps
- Try loading larger datasets into DuckDB
- Experiment with more complex queries and window functions
- Use DuckDB's COPY functionality to import/export data from/to files
- Create more advanced interactive dashboards with Marimo and Plotly
"""
)
return
@app.cell(hide_code=True)
def _():
import marimo as mo
import duckdb
import polars as pl
import os
from datetime import date
import plotly.express as plotly_express
import plotly.graph_objects as plotly_graph_objects
import numpy as np
return date, duckdb, mo, os, pl, plotly_express
if __name__ == "__main__":
app.run()