January 1, 2024
Core Concepts: Connecting to DuckDB and Executing SQL
Core Concepts: Connecting to DuckDB and Executing SQL
This article picks up where our last one left off, continuing our exploration of DuckDB by diving into the essential concepts you need to start working effectively with this powerful analytical database.
Think of DuckDB as your personal data chef, ready to transform raw data into delicious insights. Before any culinary magic happens, you need to establish your connection and learn to communicate your instructions clearly. In this article, we’ll explore how to connect to DuckDB, execute SQL queries, and define your data structures—fundamental skills for everything you’ll want to do with DuckDB.
Establishing Database Connections
The first step in working with DuckDB is establishing a connection—think of it as plugging in your data pipeline. DuckDB offers two primary storage modes:
In-Memory Database
In-memory databases are perfect for temporary data processing and experimentation:
import duckdb
import pandas as pd
# Create an in-memory database connection
con = duckdb.connect(':memory:')
# Create a table and insert data
con.execute("CREATE TABLE my_table (id INTEGER, name VARCHAR)")
con.execute("INSERT INTO my_table VALUES (1, 'Alice'), (2, 'Bob')")
# Query the data
df = con.sql("SELECT * FROM my_table").df()
print(df)
# Close the connection
con.close()
Any data created within this connection will be lost when con.close()
is called—like a whiteboard that gets erased after a meeting.
Persistent Database
Persistent databases store data on disk for long-term analysis:
import duckdb
# Create a persistent database connection
con = duckdb.connect('my_database.duckdb')
# Create a table and insert data
con.execute("CREATE TABLE my_table (id INTEGER, name VARCHAR)")
con.execute("INSERT INTO my_table VALUES (1, 'Alice'), (2, 'Bob')")
# Close the connection
con.close()
# Re-open the connection and query data
con = duckdb.connect('my_database.duckdb')
df = con.sql("SELECT * FROM my_table").df()
print(df)
con.close()
With persistent databases, data is stored on disk and remains accessible across sessions—more like a filing cabinet where you can store and retrieve documents as needed.
Choosing Between In-Memory and Persistent
Feature | In-Memory | Persistent |
---|---|---|
Persistence | No | Yes |
Performance | Faster | Slower |
Storage | Limited by RAM | Limited by Disk Space |
Use Cases | Temporary processing, testing | Long-term analysis, data pipelines |
Choose in-memory for quick, temporary work and persistent storage when you need to preserve your data between sessions.
Executing SQL Queries from Python
Once connected, you’ll need to execute SQL queries to interact with your data. DuckDB offers two primary methods for this:
The .sql()
Method
The .sql()
method is designed for executing queries where you intend to work with the results directly within Python:
import duckdb
with duckdb.connect(':memory:') as conn:
# Create and populate a table
conn.execute("CREATE TABLE my_table (id INTEGER, name VARCHAR)")
conn.execute("INSERT INTO my_table VALUES (1, 'Alice'), (2, 'Bob')")
# Execute a SELECT query
result = conn.sql('SELECT * FROM my_table')
# Convert to a Pandas DataFrame
df = result.df()
print(df)
When you run a query using .sql()
, DuckDB returns a DuckDBPyRelation
object—a view of the data that DuckDB knows how to compute. This approach defers actual data processing until you explicitly request it, making it efficient for complex operations.
The .execute()
Method with Parameter Binding
While .sql()
works well for general queries, the .execute()
method excels when incorporating external data into your queries through parameter binding. This technique offers two crucial benefits:
1.Security: Parameter binding prevents SQL injection vulnerabilities 2.Performance: It allows DuckDB to optimize the query plan more effectively
Here’s how it works:
import duckdb
with duckdb.connect(':memory:') as conn:
# Create a table
conn.execute("CREATE TABLE my_table (id INTEGER, name VARCHAR)")
conn.execute("INSERT INTO my_table VALUES (1, 'Alice'), (2, 'Bob')")
# User input (simulated)
user_id = 1
# Execute a SELECT query with parameter binding
conn.execute(
'SELECT * FROM my_table WHERE id = ?',
[user_id]
)
# Fetch the result
result = conn.fetchdf()
print(result)
In this example, the ?
acts as a placeholder for the user_id
value. The .execute()
method safely substitutes the value, ensuring it’s treated as data, not as part of the SQL command.
Parameter binding is essential when working with user input or dynamic values to prevent SQL injection attacks. Think of .execute()
as filling out a form where each field has a specific purpose, preventing errors and security risks.
Basic Data Definition Language (DDL)
Data Definition Language (DDL) is like an architect’s blueprint for your database. It defines the structure before you start filling it with data.
Creating Tables
The CREATE TABLE
statement defines the name of the table and the columns it will contain:
import duckdb
conn = duckdb.connect(':memory:')
# Create a customers table
conn.sql("""
CREATE TABLE customers (
customer_id INTEGER,
first_name VARCHAR,
last_name VARCHAR,
email VARCHAR,
signup_date DATE
);
""")
conn.close()
Choosing Data Types
Selecting the appropriate data type for each column is crucial for providing data integrity and optimizing storage:
Data Type | Description |
---|---|
INTEGER | Whole numbers |
BIGINT | Large whole numbers |
VARCHAR(n) | Variable-length strings (max n chars) |
TEXT | Variable-length strings (no max length) |
DATE | Dates |
TIMESTAMP | Dates and times |
BOOLEAN | True/false values |
DOUBLE | Floating-point numbers |
DuckDB also supports advanced data types such as UUID
, JSON
, STRUCT
, LIST
, and ENUM
.
Here’s an example creating a table with different data types:
import duckdb
conn = duckdb.connect(':memory:')
conn.sql("""
CREATE TABLE products (
product_id INTEGER,
product_name VARCHAR(255),
description TEXT,
price DOUBLE,
is_available BOOLEAN,
manufacture_date DATE
);
""")
conn.close()
Creating Schemas and Adding Constraints
Schemas help organize your tables into logical groups:
conn.sql("CREATE SCHEMA IF NOT EXISTS my_schema;")
conn.sql("CREATE TABLE my_schema.customers (customer_id INTEGER);")
Constraints ensure data integrity:
conn.sql("""
CREATE TABLE products (
product_id INTEGER PRIMARY KEY,
product_name VARCHAR(255) NOT NULL,
price DOUBLE,
discounted_price DOUBLE GENERATED ALWAYS AS (price * 0.9)
);
""")
Here, PRIMARY KEY
ensures each product has a unique ID, NOT NULL
ensures every product has a name, and the computed column automatically calculates the discounted price.
Basic DuckDB API Usage
After executing a query, you’ll need to access the results. DuckDB provides several methods for this purpose.
Fetching Query Results
Using fetchall()
to Retrieve All Rows
# Execute a query
conn.execute("SELECT * FROM products WHERE price > 30")
# Fetch all rows as a list of tuples
results = conn.fetchall()
# Process the results
for row in results:
print(f"Product: {row[1]}, Price: ${row[2]}")
Using fetchdf()
to Retrieve Results as a Pandas DataFrame
# Execute a query
conn.execute("SELECT * FROM products WHERE price > 30")
# Fetch results as a Pandas DataFrame
df = conn.fetchdf()
# Process the DataFrame
print(df)
print(f"Average price: ${df['price'].mean():.2f}")
Using arrow()
to Retrieve Results as an Arrow Table
# Execute a query
conn.execute("SELECT * FROM products WHERE price > 30")
# Fetch results as an Arrow table
arrow_table = conn.arrow()
# Convert to a Pandas DataFrame if needed
df = arrow_table.to_pandas()
print(df)
Parameter Binding Options
DuckDB supports both positional and named parameter binding:
Positional Parameter Binding
user_id = 1
username = "alice"
email = "alice@example.com"
conn.execute(
"INSERT INTO users VALUES (?, ?, ?)",
[user_id, username, email]
)
Named Parameter Binding
user_data = {
"id": 1,
"username": "alice",
"email": "alice@example.com"
}
conn.execute(
"INSERT INTO users VALUES (:id, :username, :email)",
user_data
)
Error Handling and Transaction Management
Proper error handling is essential for building robust applications:
try:
# Begin transaction
conn.execute("BEGIN TRANSACTION")
# Deduct from Alice's account
conn.execute("""
UPDATE accounts
SET balance = balance - 200
WHERE owner = 'Alice'
""")
# Add to Bob's account
conn.execute("""
UPDATE accounts
SET balance = balance + 200
WHERE owner = 'Bob'
""")
# Commit transaction
conn.execute("COMMIT")
print("Transaction committed successfully!")
except Exception as e:
# Rollback transaction on error
conn.execute("ROLLBACK")
print(f"Transaction rolled back due to error: {e}")
Transactions ensure that multiple operations either all succeed or all fail, maintaining data integrity.
Summary
In this article, we’ve covered the foundational concepts for working with DuckDB:
1.Establishing connections- Choose between in-memory (temporary) and persistent (long-term) storage based on your needs.
2.Executing SQL queries- Use .sql()
for general queries and .execute()
with parameter binding for security and efficiency.
3.Defining data structures- Create tables with appropriate data types and constraints to ensure data integrity.
4.Fetching results- Retrieve query results using fetchall()
, fetchdf()
, or arrow()
based on your requirements.
5.Handling errors- Implement robust error handling and transaction management to prevent data corruption.
These core concepts form the building blocks for everything else you’ll do with DuckDB. In our next article, we’ll dive deeper into the DuckDB Python API, exploring advanced query techniques and optimization strategies.
By mastering these fundamentals, you’re well on your way to using DuckDB’s full potential for fast, efficient data analysis. Happy querying!
That was some yummy duck! Do you fancy more Duck?
About the Author
Rick Hightower is a seasoned software architect and data engineering expert with over two decades of experience in building scalable data solutions. As a technical author and consultant, he specializes in data architecture, ETL processes, and modern data stack implementation.
With deep expertise in SQL, database optimization, and distributed systems, Rick has helped numerous organizations design and implement robust data architectures. He frequently writes about data engineering, focusing on practical applications of technologies like DuckDB, SQL optimization, and ETL processes.
Rick is passionate about sharing knowledge and making complex technical concepts accessible to practitioners at all levels. Through his articles and consulting work, he continues to contribute to the data engineering community’s growth and development.
TweetApache Spark Training
Kafka Tutorial
Akka Consulting
Cassandra Training
AWS Cassandra Database Support
Kafka Support Pricing
Cassandra Database Support Pricing
Non-stop Cassandra
Watchdog
Advantages of using Cloudurable™
Cassandra Consulting
Cloudurable™| Guide to AWS Cassandra Deploy
Cloudurable™| AWS Cassandra Guidelines and Notes
Free guide to deploying Cassandra on AWS
Kafka Training
Kafka Consulting
DynamoDB Training
DynamoDB Consulting
Kinesis Training
Kinesis Consulting
Kafka Tutorial PDF
Kubernetes Security Training
Redis Consulting
Redis Training
ElasticSearch / ELK Consulting
ElasticSearch Training
InfluxDB/TICK Training TICK Consulting