Spaces:

marimo-team
/

marimo-learn

Running

marimo-learn / duckdb /01_getting_started.py

Azmi-84

Enhance and redesign DuckDB introductory notebook

e5fc993 20 days ago

43.7 kB

	# /// script
	# requires-python = ">=3.11"
	# dependencies = [
	# "marimo",
	# "duckdb==1.2.2",
	# "polars==1.27.0",
	# "numpy==2.2.4",
	# "pyarrow==19.0.1",
	# "pandas==2.2.3",
	# "sqlglot==26.12.1",
	# "plotly==5.23.1",
	# ]
	# ///

	import marimo

	__generated_with = "0.13.4"
	app = marimo.App(width="medium")


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	rf"""
	<p align="center">
	<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSxHAqB0W_61zuIGVMiU6sEeQyTaw-9xwiprw&s" alt="DuckDB Image"/>
	</p>
	"""
	)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	rf"""
	# 🦆 DuckDB: An Embeddable Analytical Database System

	## What is DuckDB?

	[DuckDB](https://duckdb.org/) is a _high-performance_, in-process, embeddable SQL OLAP (Online Analytical Processing) Database Management System (DBMS) designed for simplicity and speed. It's essentially a fully-featured database that runs directly within your application's process, without needing a separate server. This makes it excellent for complex analytical workloads, offering a robust SQL interface and efficient processing – perfect for learning about databases and data analysis concepts. It's a great alternative to heavier database systems like PostgreSQL or MySQL when you don't need a full-blown server.

	---

	## ⚡ Key Features

	\| Feature \| Description \|
	\|:---------\|:-------------\|
	\| In-Process Architecture \| Runs directly within your application's memory space - no separate server needed, simplifying deployment \|
	\| Columnar Storage \| Data stored in columns instead of rows, dramatically improving performance for analytical queries \|
	\| Vectorized Execution \| Performs operations on entire columns at once, significantly speeding up data processing \|
	\| ACID Transactions \| Ensures data integrity and reliability across operations \|
	\| Multi-Language Support \| Provides APIs for `Python`, `R`, `Java`, `C++`, and more \|
	\| Zero External Dependencies \| Minimal dependencies, making setup and deployment straightforward \|
	\| High Portability \| Works across various operating systems (Windows, macOS, Linux) and hardware architectures \|

	---

	## [Use Cases](https://github.com/davidgasquez/awesome-duckdb?tab=readme-ov-file):

	- Data Analysis and Exploration: DuckDB is ideal for quickly querying and analyzing datasets, especially for initial exploratory analysis.
	- Embedded Analytics in Applications: You can integrate DuckDB directly into your applications to provide analytical capabilities without the need for a separate database server.
	- ETL (Extract, Transform, Load) Processes: DuckDB can be used to perform initial data transformation and cleaning steps as part of an ETL pipeline.
	- Data Science and Machine Learning Workflows: It's a lightweight alternative to larger databases for prototyping data analysis and machine learning models.
	- Rapid Prototyping of Data Analysis Pipelines: Quickly test and iterate on data analysis ideas without the complexity of setting up a full-blown database environment.
	- Small to Medium Datasets: DuckDB shines when working with datasets that don't require the massive scalability of a traditional database server.

	---

	### [Installation](https://duckdb.org/docs/installation/?version=stable&environment=python):

	- Python installation:
	```
	pip install duckdb
	```
	```
	conda install python-duckdb -c conda-forge.
	```

	<!-- >_Note_: DuckDB requires Python 3.7 or newer. You also need to have Python and `pip` or `conda` installed on your system. -->

	/// attention \| Note
	DuckDB requires Python 3.7 or newer. You also need to have Python and `pip` or `conda` installed on your system.
	///
	"""
	)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	r"""
	# [1. DuckDB Connections: In-Memory vs. File-based](https://duckdb.org/docs/stable/connect/overview.html)

	DuckDB is a lightweight, _relational database management system (RDBMS)_ designed for analytical workloads. Unlike traditional client-server databases, it operates _in-process_ (embedded within your application) and supports both _in-memory_ (temporary) and _file-based_ (persistent) storage.

	---

	\| Feature \| In-Memory Connection \| File-Based Connection \|
	\|:---------\|:---------------------\|:----------------------\|
	\| Persistence \| Temporary (lost when session ends) \| Stored on disk (persists between sessions) \|
	\| Use Cases \| Quick analysis, ephemeral data, testing \| Long-term storage, data that needs to be accessed later \|
	\| Performance \| Faster for most operations \| Slightly slower but provides persistence \|
	\| Creation \| duckdb.connect(':memory:') \| duckdb.connect('filename.db') \|
	\| Multiple Connection Access \| Limited to single connection \| Multiple connections can access the same database \|

	"""
	)
	return


	@app.cell
	def _(os):
	# Remove previous database if it exists
	if os.path.exists("example.db"):
	os.remove("example.db")

	if not os.path.exists("data"):
	os.makedirs("data")
	return


	@app.cell
	def _(mo):
	_df = mo.sql(
	f"""
	-- Print the DuckDB version
	SELECT version() AS version_info
	"""
	)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	"""
	## Creating DuckDB Connections

	Let's create both types of DuckDB connections and explore their characteristics.

	1. In-memory connection: Data exists only during the current session
	2. File-based connection: Data persists between sessions

	We'll then demonstrate the key differences between these connection types.
	"""
	)
	return


	@app.cell
	def _(duckdb):
	# Create an in-memory DuckDB connection
	memory_db = duckdb.connect(":memory:")

	# Create a file-based DuckDB connection
	file_db = duckdb.connect("example.db")
	return file_db, memory_db


	@app.cell
	def _(file_db, memory_db):
	# Test both connections
	memory_db.execute(
	"CREATE TABLE IF NOT EXISTS mem_test (id INTEGER, name VARCHAR)"
	)
	memory_db.execute("INSERT INTO mem_test VALUES (1, 'Memory Test')")

	file_db.execute(
	"CREATE TABLE IF NOT EXISTS file_test (id INTEGER, name VARCHAR)"
	)
	file_db.execute("INSERT INTO file_test VALUES (1, 'File Test')")
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	r"""
	## Testing Connection Persistence

	Let's demonstrate how in-memory databases are ephemeral, while file-based databases persist.

	1. First, we'll query our tables to confirm the data was properly inserted
	2. Then, we'll simulate an application restart by creating new connections
	3. Finally, we'll check which data persists after the "restart"
	"""
	)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(r"""## Current Database Contents""")
	return


	@app.cell
	def _(mem_test, memory_db, mo):
	_df = mo.sql(
	f"""
	SELECT * FROM mem_test
	""",
	engine=memory_db
	)
	return


	@app.cell
	def _(file_db, file_test, mo):
	_df = mo.sql(
	f"""
	SELECT * FROM file_test
	""",
	engine=file_db
	)
	return


	@app.cell
	def _():
	# We don't actually close the connections here since we need them for later cells
	# Just a placeholder for the concept
	return


	@app.cell(hide_code=True)
	def _file_query(mo):
	mo.md(rf"""## 🔄 Simulating Application Restart...""")
	return


	@app.cell
	def _(duckdb):
	# Create new connections (simulating restart)
	new_memory_db = duckdb.connect(":memory:")
	new_file_db = duckdb.connect("example.db")
	return new_file_db, new_memory_db


	@app.cell
	def _(new_memory_db):
	# Try to query tables in the new memory connection
	try:
	new_memory_db.execute("SELECT * FROM mem_test").df()
	memory_persistence = "✅ Data persisted in memory (unexpected)"
	memory_data_available = True
	except Exception as e:
	memory_persistence = "❌ Data lost from memory (expected behavior)"
	memory_data_available = False
	return memory_data_available, memory_persistence


	@app.cell
	def _(new_file_db):
	# Try to query tables in the new file connection
	try:
	file_data = new_file_db.execute("SELECT * FROM file_test").df()
	file_persistence = "✅ Data persisted in file (expected behavior)"
	file_data_available = True
	except Exception as e:
	file_persistence = "❌ Data lost from file (unexpected)"
	file_data_available = False
	file_data = None
	return file_data, file_data_available, file_persistence


	@app.cell
	def _(
	file_data_available,
	file_persistence,
	memory_data_available,
	memory_persistence,
	mo,
	):
	# Create an interactive display to show persistence results
	persistence_results = mo.ui.table(
	{
	"Connection Type": ["In-Memory Database", "File-Based Database"],
	"Persistence Status": [memory_persistence, file_persistence],
	"Data Available After Restart": [
	memory_data_available,
	file_data_available,
	],
	}
	)

	mo.md("### Persistence Test Results")
	return (persistence_results,)


	@app.cell
	def _(persistence_results):
	persistence_results
	return


	@app.cell
	def _(file_data, file_data_available, mo):
	if file_data_available:
	mo.md("### Persisted File-Based Data:")
	mo.ui.table(file_data)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	r"""
	# [2. Creating Tables in DuckDB](https://duckdb.org/docs/stable/sql/statements/create_table.html)

	DuckDB supports standard SQL syntax for creating tables. Let's create more complex tables to demonstrate different data types and constraints.

	## Table Creation Options

	DuckDB supports various table creation options, including:

	- Basic tables with column definitions
	- Temporary tables that exist only during the session
	- CREATE OR REPLACE to recreate tables
	- Primary keys and other constraints
	- Various data types including INTEGER, VARCHAR, TIMESTAMP, DECIMAL, etc.
	"""
	)
	return


	@app.cell
	def _create_users_tables(file_db, new_memory_db):
	# For the memory database
	try:
	new_memory_db.execute("DROP TABLE IF EXISTS users_memory")
	except:
	pass

	# For the file database
	try:
	file_db.execute("DROP TABLE IF EXISTS users_file")
	except:
	pass
	return


	@app.cell
	def _(file_db, new_memory_db):
	# Create advanced users table in memory database with primary key
	new_memory_db.execute("""
	CREATE TABLE users_memory (
	id INTEGER PRIMARY KEY,
	name VARCHAR NOT NULL,
	age INTEGER CHECK (age > 0),
	email VARCHAR UNIQUE,
	registration_date DATE DEFAULT CURRENT_DATE,
	last_login TIMESTAMP,
	account_balance DECIMAL(10,2) DEFAULT 0.00
	)
	""")

	# Create users table in file database
	file_db.execute("""
	CREATE TABLE users_file (
	id INTEGER PRIMARY KEY,
	name VARCHAR NOT NULL,
	age INTEGER CHECK (age > 0),
	email VARCHAR UNIQUE,
	registration_date DATE DEFAULT CURRENT_DATE,
	last_login TIMESTAMP,
	account_balance DECIMAL(10,2) DEFAULT 0.00
	)
	""")
	return


	@app.cell
	def _(mo, new_memory_db):
	# Get table schema information using DuckDB's internal system tables
	memory_schema = new_memory_db.execute("""
	SELECT column_name, data_type, is_nullable
	FROM information_schema.columns
	WHERE table_name = 'users_memory'
	ORDER BY ordinal_position
	""").df()

	# Display the schema using marimo's UI components
	mo.md("### 🔍 Table Schema Information")
	return (memory_schema,)


	@app.cell(hide_code=True)
	def _(memory_schema, mo):
	mo.ui.table(memory_schema)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	r"""
	# [3. Inserting Data Into Tables](https://duckdb.org/docs/stable/sql/statements/insert)

	DuckDB supports multiple ways to insert data:

	1. INSERT INTO VALUES: Insert specific values
	2. INSERT INTO SELECT: Insert data from query results
	3. Parameterized inserts: Using prepared statements
	4. Bulk inserts: For efficient loading of multiple rows

	Let's demonstrate these different insertion methods:
	"""
	)
	return


	@app.cell
	def _insert_user_data(date):
	today = date.today()


	# First check if records already exist to avoid duplicate key errors
	def safe_insert(connection, table_name, data):
	"""
	Safely insert data into a table by checking for existing IDs first
	"""
	# Check which IDs already exist in the table
	existing_ids = (
	connection.execute(f"SELECT id FROM {table_name}")
	.fetchdf()["id"]
	.tolist()
	)

	# Filter out data with IDs that already exist
	new_data = [record for record in data if record[0] not in existing_ids]

	if not new_data:
	print(
	f"No new records to insert into {table_name}. All IDs already exist."
	)
	return 0

	# Prepare the placeholders for the SQL statement
	placeholders = ", ".join(
	["(" + ", ".join(["?"] * len(new_data[0])) + ")"] * len(new_data)
	)

	# Flatten the list of tuples for parameter binding
	flat_data = [item for sublist in new_data for item in sublist]

	# Perform the insertion
	if flat_data:
	columns = "(id, name, age, email, registration_date, last_login, account_balance)"
	connection.execute(
	f"INSERT INTO {table_name} {columns} VALUES {placeholders}",
	flat_data,
	)
	return len(new_data)
	return 0
	return (safe_insert,)


	@app.cell
	def _():
	# Prepare the data
	user_data = [
	(
	1,
	"Alice",
	25,
	"alice@example.com",
	"2021-01-01",
	"2023-01-15 14:30:00",
	1250.75,
	),
	(
	2,
	"Bob",
	30,
	"bob@example.com",
	"2021-02-01",
	"2023-02-10 09:15:22",
	750.50,
	),
	(
	3,
	"Charlie",
	35,
	"charlie@example.com",
	"2021-03-01",
	"2023-03-05 17:45:10",
	3200.25,
	),
	(
	4,
	"David",
	40,
	"david@example.com",
	"2021-04-01",
	"2023-04-20 10:30:45",
	1800.00,
	),
	(
	5,
	"Emma",
	45,
	"emma@example.com",
	"2021-05-01",
	"2023-05-12 11:20:30",
	2500.00,
	),
	(
	6,
	"Frank",
	50,
	"frank@example.com",
	"2021-06-01",
	"2023-06-18 16:10:15",
	900.25,
	),
	]
	return (user_data,)


	@app.cell
	def _(mo, new_memory_db, safe_insert, user_data):
	# Safely insert data into memory database
	records_inserted = safe_insert(new_memory_db, "users_memory", user_data)
	mo.md(
	f"""
	Inserted {records_inserted} new records into users_memory.
	"""
	)
	return


	@app.cell
	def _(file_db, safe_insert, user_data):
	def _():
	# Safely insert data into file database
	records_inserted = safe_insert(file_db, "users_file", user_data)
	return print(f"Inserted {records_inserted} new records into users_file")


	_()
	return


	@app.cell
	def _():
	# If you need to add just one record, you can use a similar approach:
	new_user = (
	7,
	"Grace",
	28,
	"grace@example.com",
	"2021-07-01",
	"2023-07-22 13:45:10",
	1675.50,
	)
	return (new_user,)


	@app.cell
	def _(new_memory_db, new_user):
	# Check if the ID exists before inserting
	if not new_memory_db.execute(
	"SELECT id FROM users_memory WHERE id = ?", [new_user[0]]
	).fetchone():
	new_memory_db.execute(
	"""
	INSERT INTO users_memory (id, name, age, email, registration_date, last_login, account_balance)
	VALUES (?, ?, ?, ?, ?, ?, ?)
	""",
	new_user,
	)
	print(f"Added user {new_user[1]} to users_memory")
	else:
	print(f"User with ID {new_user[0]} already exists in users_memory")
	return


	@app.cell
	def _(file_db, new_user):
	# Do the same for the file database
	if not file_db.execute(
	"SELECT id FROM users_file WHERE id = ?", [new_user[0]]
	).fetchone():
	file_db.execute(
	"""
	INSERT INTO users_file (id, name, age, email, registration_date, last_login, account_balance)
	VALUES (?, ?, ?, ?, ?, ?, ?)
	""",
	new_user,
	)
	print(f"Added user {new_user[1]} to users_file")
	else:
	print(f"User with ID {new_user[0]} already exists in users_file")
	return


	@app.cell
	def _(new_memory_db):
	# First try to update
	cursor = new_memory_db.execute(
	"""
	UPDATE users_memory
	SET name = ?, age = ?, email = ?,
	registration_date = ?, last_login = ?, account_balance = ?
	WHERE id = ?
	""",
	(
	"Henry",
	33,
	"henry@example.com",
	"2021-08-01",
	"2023-08-05 09:10:15",
	3100.75,
	8, # ID should be the last parameter
	),
	)
	return (cursor,)


	@app.cell
	def _(cursor, mo, new_memory_db):
	# If no rows were updated, perform an insert
	if cursor.rowcount == 0:
	new_memory_db.execute(
	"""
	INSERT INTO users_memory
	(id, name, age, email, registration_date, last_login, account_balance)
	VALUES (?, ?, ?, ?, ?, ?, ?)
	""",
	(
	8,
	"Henry",
	33,
	"henry@example.com",
	"2021-08-01",
	"2023-08-05 09:10:15",
	3100.75,
	),
	)

	mo.md(
	f"""
	Upserted Henry into users_memory.
	"""
	)
	return


	@app.cell
	def _(file_db, mo):
	# For DuckDB using ON CONFLICT, we need to specify the conflict target column
	file_db.execute(
	"""
	INSERT INTO users_file (id, name, age, email, registration_date, last_login, account_balance)
	VALUES (?, ?, ?, ?, ?, ?, ?)
	ON CONFLICT (id) DO UPDATE SET
	name = EXCLUDED.name,
	age = EXCLUDED.age,
	email = EXCLUDED.email,
	registration_date = EXCLUDED.registration_date,
	last_login = EXCLUDED.last_login,
	account_balance = EXCLUDED.account_balance
	""",
	(
	8,
	"Henry",
	33,
	"henry@example.com",
	"2021-08-01",
	"2023-08-05 09:10:15",
	3100.75,
	),
	)

	mo.md(
	f"""
	Upserted Henry into users_file.
	"""
	)
	return


	@app.cell
	def _view_tables_after_insert(new_memory_db):
	# Display memory data using DuckDB's query capabilities
	memory_results = new_memory_db.execute("""
	SELECT
	id,
	name,
	age,
	email,
	registration_date,
	last_login,
	account_balance
	FROM users_memory
	ORDER BY id
	""").df()
	return (memory_results,)


	@app.cell
	def _(file_db):
	# Display file data with formatting
	file_results = file_db.execute("""
	SELECT
	id,
	name,
	age,
	email,
	registration_date,
	last_login,
	CAST(account_balance AS DECIMAL(10,2)) AS account_balance
	FROM users_file
	ORDER BY id
	""").df()
	return (file_results,)


	@app.cell
	def _(mo):
	mo.md(
	r"""
	<!-- Create an interactive display with tabs using marimo components -->
	## 📊 Database Contents After Insertion
	"""
	)
	return


	@app.cell(hide_code=True)
	def _(file_results, memory_results, mo):
	tabs = mo.ui.tabs(
	{
	"In-Memory Database": mo.ui.table(memory_results),
	"File-Based Database": mo.ui.table(file_results),
	}
	)
	tabs
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	r"""
	# [4. Using SQL Directly in Marimo](https://duckdb.org/docs/stable/sql/query_syntax/select)

	There are multiple ways to leverage DuckDB's SQL capabilities in marimo:

	1. Direct execution: Using DuckDB connections to execute SQL
	2. Marimo SQL: Using Marimo's built-in SQL engine
	3. Interactive queries: Combining UI elements with SQL execution

	Let's explore these approaches:
	"""
	)
	return


	@app.cell(hide_code=True)
	def _sql_with_marimo(mo):
	mo.md(
	rf"""
	<!-- Using Marimo's SQL engine with direct SQL on memory_results DataFrame -->
	## 🔍 Query with Marimo SQL
	"""
	)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	rf"""
	## Marimo has its own built-in SQL engine that can work with DataFrames.
	Let's use it to filter our users:
	"""
	)
	return


	@app.cell
	def _(mo):
	# Create a SQL selector for users with age threshold
	age_threshold = mo.ui.slider(25, 50, value=30, label="Minimum Age")
	return (age_threshold,)


	@app.cell
	def _(age_threshold, memory_results, mo):
	# Create a function to filter users based on the slider value
	def filtered_users():
	# Use DuckDB directly instead of mo.sql with users param
	filtered_df = memory_results[memory_results["age"] >= age_threshold.value]
	filtered_df = filtered_df.sort_values("age")
	return mo.ui.table(filtered_df)
	return (filtered_users,)


	@app.cell
	def _(age_threshold, filtered_users, mo):
	layout = mo.vstack(
	[
	mo.md("### Select minimum age:"),
	age_threshold,
	mo.md("### Users meeting age criteria:"),
	filtered_users(),
	],
	gap=1.5,
	)
	layout
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(r"""# [5. Working with Polars and DuckDB](https://duckdb.org/docs/stable/guides/python/polars.html)""")
	return


	@app.cell
	def _polars_integration(pl):
	# Create a Polars DataFrame
	polars_df = pl.DataFrame(
	{
	"id": [101, 102, 103],
	"name": ["Product A", "Product B", "Product C"],
	"price": [29.99, 49.99, 19.99],
	"category": ["Electronics", "Furniture", "Books"],
	}
	)
	return (polars_df,)


	@app.cell
	def _(mo):
	mo.md(
	rf"""
	<!-- Display the Polars DataFrame -->
	## Original Polars DataFrame:
	"""
	)
	return


	@app.cell
	def _(mo, polars_df):
	mo.ui.table(polars_df)
	return


	@app.cell
	def _(new_memory_db, polars_df):
	# Register the Polars DataFrame as a DuckDB table in memory connection
	new_memory_db.register("products_polars", polars_df)

	# Query the registered table
	polars_query_result = new_memory_db.execute(
	"SELECT * FROM products_polars WHERE price > 25"
	).df()
	return (polars_query_result,)


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	r"""
	<!-- Display the query result -->
	## DuckDB Query Result (From Polars Data):
	"""
	)
	return


	@app.cell
	def _(mo, polars_query_result):
	mo.ui.table(polars_query_result)
	return


	@app.cell
	def _(mo, new_memory_db):
	# Demonstrate a more complex query
	complex_query_result = new_memory_db.execute("""
	SELECT
	category,
	COUNT(*) as product_count,
	AVG(price) as avg_price,
	MIN(price) as min_price,
	MAX(price) as max_price
	FROM products_polars
	GROUP BY category
	ORDER BY avg_price DESC
	""").df()

	mo.md("## Aggregated Product Data by Category:")
	return (complex_query_result,)


	@app.cell
	def _(complex_query_result, mo):
	mo.ui.table(complex_query_result)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(r"""# [6. Advanced Queries: Joins Between Tables](https://duckdb.org/docs/stable/guides/performance/join_operations.html)""")
	return


	@app.cell
	def _join_operations(new_memory_db):
	# Create another table to join with
	new_memory_db.execute("""
	CREATE TABLE IF NOT EXISTS departments (
	id INTEGER,
	department_name VARCHAR,
	manager_id INTEGER
	)
	""")
	return


	@app.cell
	def _(new_memory_db):
	new_memory_db.execute("""
	INSERT INTO departments VALUES
	(101, 'Engineering', 1),
	(102, 'Marketing', 2),
	(103, 'Finance', NULL)
	""")
	return


	@app.cell
	def _(new_memory_db):
	# Execute a join query
	join_result = new_memory_db.execute("""
	SELECT
	u.id,
	u.name,
	u.age,
	d.department_name
	FROM users_memory u
	LEFT JOIN departments d ON u.id = d.manager_id
	ORDER BY u.id
	""").df()
	return (join_result,)


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	rf"""
	<!-- Display the join result -->
	## Join Result (Users and Departments):
	"""
	)
	return


	@app.cell
	def _(join_result, mo):
	mo.ui.table(join_result)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	rf"""
	<!-- Demonstrate different types of joins -->
	## Different Types of Joins
	"""
	)
	return


	@app.cell
	def _(new_memory_db):
	# Inner join
	inner_join = new_memory_db.execute("""
	SELECT u.id, u.name, d.department_name
	FROM users_memory u
	INNER JOIN departments d ON u.id = d.manager_id
	""").df()

	# Right join
	right_join = new_memory_db.execute("""
	SELECT u.id, u.name, d.department_name
	FROM users_memory u
	RIGHT JOIN departments d ON u.id = d.manager_id
	""").df()

	# Full outer join
	full_join = new_memory_db.execute("""
	SELECT u.id, u.name, d.department_name
	FROM users_memory u
	FULL OUTER JOIN departments d ON u.id = d.manager_id
	""").df()
	return full_join, inner_join, right_join


	@app.cell
	def _(full_join, inner_join, join_result, mo, right_join):
	join_tabs = mo.ui.tabs(
	{
	"Left Join": mo.ui.table(join_result),
	"Inner Join": mo.ui.table(inner_join),
	"Right Join": mo.ui.table(right_join),
	"Full Outer Join": mo.ui.table(full_join),
	}
	)

	join_tabs
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(r"""# [7. Aggregate Functions in DuckDB](https://duckdb.org/docs/stable/sql/functions/aggregates.html)""")
	return


	@app.cell
	def _aggregate_operations(new_memory_db):
	# Execute an aggregate query
	agg_result = new_memory_db.execute("""
	SELECT
	AVG(age) as avg_age,
	MAX(age) as max_age,
	MIN(age) as min_age,
	COUNT(*) as total_users,
	SUM(account_balance) as total_balance
	FROM users_memory
	""").df()
	return (agg_result,)


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	rf"""
	<!-- Display the aggregate result -->
	## Aggregate Results (All Users):
	"""
	)
	return


	@app.cell
	def _(agg_result, mo):
	mo.ui.table(agg_result)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	rf"""
	<!-- More complex aggregate query with grouping -->
	## Aggregate Results (Grouped by Age Range):
	"""
	)
	return


	@app.cell
	def _(new_memory_db):
	age_groups = new_memory_db.execute("""
	SELECT
	CASE
	WHEN age < 30 THEN 'Under 30'
	WHEN age BETWEEN 30 AND 40 THEN '30 to 40'
	ELSE 'Over 40'
	END as age_group,
	COUNT(*) as count,
	AVG(age) as avg_age,
	AVG(account_balance) as avg_balance
	FROM users_memory
	GROUP BY 1
	ORDER BY 1
	""").df()
	return (age_groups,)


	@app.cell
	def _(age_groups, mo):
	mo.ui.table(age_groups)
	return


	@app.cell
	def _(mo):
	mo.md(
	r"""
	<!-- Window functions demo -->
	### Window Functions Example:
	"""
	)
	return


	@app.cell
	def _(mo, new_memory_db):
	window_result = new_memory_db.execute("""
	SELECT
	id,
	name,
	age,
	account_balance,
	RANK() OVER (ORDER BY account_balance DESC) as balance_rank,
	account_balance - AVG(account_balance) OVER () as diff_from_avg,
	account_balance / SUM(account_balance) OVER () * 100 as pct_of_total
	FROM users_memory
	ORDER BY balance_rank
	""").df()

	mo.ui.table(window_result)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(r"""# [8. Converting DuckDB Results to Polars/Pandas](https://duckdb.org/docs/stable/guides/python/polars.html)""")
	return


	@app.cell
	def _convert_results(new_memory_db):
	polars_result = new_memory_db.execute(
	"""SELECT * FROM users_memory WHERE age > 25 ORDER BY age"""
	).pl()
	return (polars_result,)


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	r"""
	<!-- Display the converted results -->
	## Query Result as Polars DataFrame:
	"""
	)
	return


	@app.cell
	def _(mo, polars_result):
	mo.ui.table(polars_result)
	return


	@app.cell
	def _(new_memory_db):
	pandas_result = new_memory_db.execute(
	"""SELECT * FROM users_memory WHERE age > 25 ORDER BY age"""
	).fetch_df()
	return (pandas_result,)


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(r"""## Same Query Result as Pandas DataFrame:""")
	return


	@app.cell
	def _(mo, pandas_result):
	mo.ui.table(pandas_result)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	r"""
	<!-- Demonstrate the differences in handling -->
	## Differences in DataFrame Handling
	"""
	)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	r"""
	<!-- Polars operation -->
	## Polars: Filter users over 35 and calculate average balance
	"""
	)
	return


	@app.cell
	def _(mo, pl, polars_result):
	def _():
	polars_filtered = polars_result.filter(pl.col("age") > 35)
	polars_avg = polars_filtered.select(
	pl.col("account_balance").mean().alias("avg_balance")
	)

	layout = mo.vstack(
	[
	mo.md("### Filtered Polars DataFrame (Age > 35):"),
	mo.ui.table(polars_filtered),
	mo.md("### Average Account Balance:"),
	mo.ui.table(polars_avg),
	],
	gap=1.5,
	)
	return layout


	_()
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	r"""
	<!-- Pandas equivalent (using pandas style) -->
	## Pandas: Same operation in pandas style
	"""
	)
	return


	@app.cell
	def _(mo, pandas_result):
	pandas_avg = pandas_result[pandas_result["age"] > 35]["account_balance"].mean()
	mo.md(f"Average balance: {pandas_avg:.2f}")
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md("""## 9. Data Visualization with DuckDB and Plotly""")
	return


	@app.cell
	def _(age_groups, mo, new_memory_db, plotly_express):
	# User distribution by age group
	fig1 = plotly_express.bar(
	age_groups,
	x="age_group",
	y="count",
	title="User Distribution by Age Group",
	labels={"count": "Number of Users", "age_group": "Age Group"},
	color="age_group",
	color_discrete_sequence=plotly_express.colors.qualitative.Plotly,
	)
	fig1.update_traces(
	text=age_groups["count"],
	textposition="outside",
	)
	fig1.update_layout(height=450, margin=dict(t=50, b=50))


	# Average balance by age group
	fig2 = plotly_express.bar(
	age_groups,
	x="age_group",
	y="avg_balance",
	title="Average Account Balance by Age Group",
	labels={"avg_balance": "Average Balance ($)", "age_group": "Age Group"},
	color="age_group",
	color_discrete_sequence=plotly_express.colors.qualitative.Plotly,
	)
	fig2.update_traces(
	text=[f"${val:.2f}" for val in age_groups["avg_balance"]],
	textposition="outside",
	)
	fig2.update_layout(height=450, margin=dict(t=50, b=50))


	# Age vs Account Balance scatter plot
	scatter_data = new_memory_db.execute(
	"""
	SELECT
	name,
	age,
	account_balance
	FROM users_memory
	ORDER BY age
	"""
	).df()

	fig3 = plotly_express.scatter(
	scatter_data,
	x="age",
	y="account_balance",
	title="Age vs. Account Balance",
	labels={"account_balance": "Account Balance ($)", "age": "Age"},
	color_discrete_sequence=["#FF7F0E"],
	trendline="ols",
	hover_data=["age", "account_balance"],
	size_max=15,
	)
	fig3.update_traces(marker=dict(size=12))
	fig3.update_layout(height=450, margin=dict(t=50, b=50))


	# Distribution of account balances
	balance_data = new_memory_db.execute(
	"""
	SELECT
	name,
	account_balance
	FROM users_memory
	ORDER BY account_balance DESC
	"""
	).df()

	fig4 = plotly_express.pie(
	balance_data,
	names="name",
	values="account_balance",
	title="Distribution of Account Balances",
	labels={"account_balance": "Account Balance ($)", "name": "User"},
	color_discrete_sequence=plotly_express.colors.qualitative.Pastel,
	)
	fig4.update_traces(textinfo="percent+label", textposition="inside")
	fig4.update_layout(height=450, margin=dict(t=50, b=50))


	category_tabs = mo.ui.tabs(
	{
	"Age Group Analysis": mo.vstack(
	[
	mo.ui.tabs(
	{
	"User Distribution": mo.ui.plotly(fig1),
	"Average Balance": mo.ui.plotly(fig2),
	}
	)
	]
	),
	"Financial Analysis": mo.vstack(
	[
	mo.ui.tabs(
	{
	"Age vs Balance": mo.ui.plotly(fig3),
	"Balance Distribution": mo.ui.plotly(fig4),
	}
	)
	]
	),
	},
	lazy=True,
	)

	mo.vstack(
	[
	mo.md("### Select a visualization category:"),
	category_tabs,
	],
	gap=1.5,
	)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	r"""
	# [9. Database Management Best Practices]

	### Closing Connections

	It's important to close database connections when you're done with them, especially for file-based connections:

	```python
	memory_db.close()
	file_db.close()
	```

	### Transaction Management

	DuckDB supports transactions, which can be useful for more complex operations:

	```python
	conn = duckdb.connect('mydb.db')
	conn.begin() # Start transaction

	try:
	conn.execute("INSERT INTO users VALUES (1, 'Test User')")
	conn.execute("UPDATE balances SET amount = amount - 100 WHERE user_id = 1")
	conn.commit() # Commit changes
	except:
	conn.rollback() # Undo changes if error
	raise
	```

	### Query Performance

	DuckDB is optimized for analytical queries. For best performance:

	- Use appropriate data types
	- Create indexes for frequently queried columns
	- For large datasets, consider partitioning
	- Use prepared statements for repeated queries
	"""
	)
	return


	@app.cell(hide_code=True)
	def _interactive_dashboard(mo):
	mo.md(rf"""## 10. Interactive DuckDB Dashboard with Marimo and Plotly""")
	return


	@app.cell
	def _(mo):
	# Create an interactive filter for age range
	min_age = mo.ui.slider(20, 50, value=25, label="Minimum Age")
	max_age = mo.ui.slider(20, 50, value=50, label="Maximum Age")
	return max_age, min_age


	@app.cell
	def _(max_age, min_age, new_memory_db):
	# Create a function to filter data and update visualizations
	def get_filtered_data(min_val=min_age.value, max_val=max_age.value):
	# Get filtered data based on slider values using parameterized query for safety
	return new_memory_db.execute(
	"""
	SELECT
	id,
	name,
	age,
	email,
	account_balance,
	registration_date
	FROM users_memory
	WHERE age >= ? AND age <= ?
	ORDER BY age
	""",
	[min_val, max_val],
	).df()
	return (get_filtered_data,)


	@app.cell
	def _(get_filtered_data):
	def get_metrics(data=get_filtered_data()):
	return {
	"user count": len(data),
	"avg_balance": data["account_balance"].mean() if len(data) > 0 else 0,
	"total_balance": data["account_balance"].sum() if len(data) > 0 else 0,
	}
	return (get_metrics,)


	@app.cell
	def _(get_metrics, mo):
	def metrics_display(metrics=get_metrics()):
	return mo.hstack(
	[
	mo.vstack(
	[
	mo.md("### Selected Users"),
	mo.md(f"## {metrics['user count']}"),
	],
	align="center",
	),
	mo.vstack(
	[
	mo.md("### Average Balance"),
	mo.md(f"## ${metrics['avg_balance']:.2f}"),
	],
	align="center",
	),
	mo.vstack(
	[
	mo.md("### Total Balance"),
	mo.md(f"## ${metrics['total_balance']:.2f}"),
	],
	align="center",
	),
	],
	justify="space-between",
	gap=1.5,
	)
	return (metrics_display,)


	@app.cell
	def _(get_filtered_data, max_age, min_age, mo, plotly_express):
	def create_visualization(
	data=get_filtered_data(), min_val=min_age.value, max_val=max_age.value
	):
	if len(data) == 0:
	return mo.ui.text("No data available for the selected age range.")

	# Create visualizations for filtered data
	fig1 = plotly_express.bar(
	data,
	x="name",
	y="account_balance",
	title=f"Account Balance by User (Age {min_val} - {max_val})",
	labels={"account_balance": "Account Balance ($)", "name": "User"},
	color="account_balance",
	color_continuous_scale=plotly_express.colors.sequential.Plasma,
	text_auto=".2s",
	)
	fig1.update_layout(
	height=400,
	xaxis_tickangle=-45,
	margin=dict(t=50, b=70, l=50, r=30),
	)
	fig1.update_traces(
	textposition="outside",
	)

	fig2 = plotly_express.histogram(
	data,
	x="age",
	nbins=min(10, len(set(data["age"]))),
	title=f"Age Distribution (Age {min_val} - {max_val})",
	color_discrete_sequence=["#4C78A8"],
	opacity=0.8,
	histnorm="probability density",
	)
	fig2.update_layout(
	height=400,
	margin=dict(t=50, b=70, l=50, r=30),
	bargap=0.1,
	)

	fig3 = plotly_express.scatter(
	data,
	x="age",
	y="account_balance",
	title=f"Age vs. Account Balance (Age {min_val} - {max_val})",
	labels={"account_balance": "Account Balance ($)", "age": "Age"},
	color="age",
	color_continuous_scale="Viridis",
	size_max=25,
	size="account_balance",
	hover_name="name",
	)
	fig3.update_layout(
	height=400,
	margin=dict(t=50, b=70, l=50, r=30),
	)

	return mo.ui.tabs(
	{
	"Account Balance by User": mo.ui.plotly(fig1),
	"Age Distribution": mo.ui.plotly(fig2),
	"Age vs. Account Balance": mo.ui.plotly(fig3),
	}
	)
	return (create_visualization,)


	@app.cell
	def _(
	create_visualization,
	get_filtered_data,
	max_age,
	metrics_display,
	min_age,
	mo,
	):
	def dashboard(
	min_val=min_age.value,
	max_val=max_age.value,
	metrics=metrics_display(),
	data=get_filtered_data(),
	visualization=create_visualization()
	):
	return mo.vstack(
	[
	mo.md(f"### Interactive Dashboard (Age {min_val} - {max_val})"),
	metrics,
	mo.md("### Data Table"),
	mo.ui.table(data, page_size=5),
	mo.md("### Visualizations"),
	visualization,
	],
	gap=2
	)
	dashboard()
	return


	@app.cell(hide_code=True)
	def _conclusion(mo):
	mo.md(
	rf"""
	# Summary and Key Takeaways

	In this notebook, we've explored DuckDB, a powerful embedded analytical database system. Here's what we covered:

	1. Connection types: We learned the difference between in-memory databases (temporary) and file-based databases (persistent).

	2. Table creation: We created tables with various data types, constraints, and primary keys.

	3. Data insertion: We demonstrated different ways to insert data, including single inserts and bulk loading.

	4. SQL queries: We executed various SQL queries directly and through Marimo's UI components.

	5. Integration with Polars: We showed how DuckDB can work seamlessly with Polars DataFrames.

	6. Joins and relationships: We performed JOIN operations between tables to combine related data.

	7. Aggregation: We used aggregate functions to summarize and analyze data.

	8. Data conversion: We converted DuckDB results to both Polars and Pandas DataFrames.

	9. Best practices: We reviewed best practices for managing DuckDB connections and transactions.

	10. Visualization: We created interactive visualizations and dashboards with Plotly and Marimo.

	DuckDB is an excellent tool for data analysis, especially for analytical workloads. Its in-process nature makes it fast and easy to use, while its SQL compatibility makes it accessible for anyone familiar with SQL databases.

	### Next Steps

	- Try loading larger datasets into DuckDB
	- Experiment with more complex queries and window functions
	- Use DuckDB's COPY functionality to import/export data from/to files
	- Create more advanced interactive dashboards with Marimo and Plotly
	"""
	)
	return


	@app.cell(hide_code=True)
	def _():
	import marimo as mo
	import duckdb
	import polars as pl
	import os
	from datetime import date
	import plotly.express as plotly_express
	import plotly.graph_objects as plotly_graph_objects
	import numpy as np
	return date, duckdb, mo, os, pl, plotly_express


	if __name__ == "__main__":
	app.run()