January 1, 2024

Core Concepts: Connecting to DuckDB and Executing SQL

This article picks up where our last one left off, continuing our exploration of DuckDB by diving into the essential concepts you need to start working effectively with this powerful analytical database.

Think of DuckDB as your personal data chef, ready to transform raw data into delicious insights. Before any culinary magic happens, you need to establish your connection and learn to communicate your instructions clearly. In this article, we’ll explore how to connect to DuckDB, execute SQL queries, and define your data structures—fundamental skills for everything you’ll want to do with DuckDB.

Establishing Database Connections

The first step in working with DuckDB is establishing a connection—think of it as plugging in your data pipeline. DuckDB offers two primary storage modes:

In-Memory Database

In-memory databases are perfect for temporary data processing and experimentation:

import duckdb
import pandas as pd


# Create an in-memory database connection
con = duckdb.connect(':memory:')


# Create a table and insert data
con.execute("CREATE TABLE my_table (id INTEGER, name VARCHAR)")
con.execute("INSERT INTO my_table VALUES (1, 'Alice'), (2, 'Bob')")


# Query the data
df = con.sql("SELECT * FROM my_table").df()
print(df)


# Close the connection
con.close()

Any data created within this connection will be lost when con.close() is called—like a whiteboard that gets erased after a meeting.

Persistent Database

Persistent databases store data on disk for long-term analysis:

import duckdb


# Create a persistent database connection
con = duckdb.connect('my_database.duckdb')


# Create a table and insert data
con.execute("CREATE TABLE my_table (id INTEGER, name VARCHAR)")
con.execute("INSERT INTO my_table VALUES (1, 'Alice'), (2, 'Bob')")


# Close the connection
con.close()


# Re-open the connection and query data
con = duckdb.connect('my_database.duckdb')
df = con.sql("SELECT * FROM my_table").df()
print(df)
con.close()

With persistent databases, data is stored on disk and remains accessible across sessions—more like a filing cabinet where you can store and retrieve documents as needed.

Choosing Between In-Memory and Persistent

Feature	In-Memory	Persistent
Persistence	No	Yes
Performance	Faster	Slower
Storage	Limited by RAM	Limited by Disk Space
Use Cases	Temporary processing, testing	Long-term analysis, data pipelines

Choose in-memory for quick, temporary work and persistent storage when you need to preserve your data between sessions.

Executing SQL Queries from Python

Once connected, you’ll need to execute SQL queries to interact with your data. DuckDB offers two primary methods for this:

The `.sql()` Method

The .sql() method is designed for executing queries where you intend to work with the results directly within Python:

import duckdb

with duckdb.connect(':memory:') as conn:
    # Create and populate a table
    conn.execute("CREATE TABLE my_table (id INTEGER, name VARCHAR)")
    conn.execute("INSERT INTO my_table VALUES (1, 'Alice'), (2, 'Bob')")

    # Execute a SELECT query
    result = conn.sql('SELECT * FROM my_table')

    # Convert to a Pandas DataFrame
    df = result.df()
    print(df)

When you run a query using .sql(), DuckDB returns a DuckDBPyRelation object—a view of the data that DuckDB knows how to compute. This approach defers actual data processing until you explicitly request it, making it efficient for complex operations.

The `.execute()` Method with Parameter Binding

While .sql() works well for general queries, the .execute() method excels when incorporating external data into your queries through parameter binding. This technique offers two crucial benefits:

1.Security: Parameter binding prevents SQL injection vulnerabilities 2.Performance: It allows DuckDB to optimize the query plan more effectively

Here’s how it works:

import duckdb

with duckdb.connect(':memory:') as conn:
    # Create a table
    conn.execute("CREATE TABLE my_table (id INTEGER, name VARCHAR)")
    conn.execute("INSERT INTO my_table VALUES (1, 'Alice'), (2, 'Bob')")

    # User input (simulated)
    user_id = 1

    # Execute a SELECT query with parameter binding
    conn.execute(
        'SELECT * FROM my_table WHERE id = ?',
        [user_id]
    )

    # Fetch the result
    result = conn.fetchdf()
    print(result)

In this example, the ? acts as a placeholder for the user_id value. The .execute() method safely substitutes the value, ensuring it’s treated as data, not as part of the SQL command.

Parameter binding is essential when working with user input or dynamic values to prevent SQL injection attacks. Think of .execute() as filling out a form where each field has a specific purpose, preventing errors and security risks.

Basic Data Definition Language (DDL)

Data Definition Language (DDL) is like an architect’s blueprint for your database. It defines the structure before you start filling it with data.

Creating Tables

The CREATE TABLE statement defines the name of the table and the columns it will contain:

import duckdb

conn = duckdb.connect(':memory:')


# Create a customers table
conn.sql("""
CREATE TABLE customers (
    customer_id INTEGER,
    first_name VARCHAR,
    last_name VARCHAR,
    email VARCHAR,
    signup_date DATE
);
""")

conn.close()

Choosing Data Types

Selecting the appropriate data type for each column is crucial for providing data integrity and optimizing storage:

Data Type	Description
INTEGER	Whole numbers
BIGINT	Large whole numbers
VARCHAR(n)	Variable-length strings (max n chars)
TEXT	Variable-length strings (no max length)
DATE	Dates
TIMESTAMP	Dates and times
BOOLEAN	True/false values
DOUBLE	Floating-point numbers

DuckDB also supports advanced data types such as UUID, JSON, STRUCT, LIST, and ENUM.

Here’s an example creating a table with different data types:

import duckdb

conn = duckdb.connect(':memory:')

conn.sql("""
CREATE TABLE products (
    product_id INTEGER,
    product_name VARCHAR(255),
    description TEXT,
    price DOUBLE,
    is_available BOOLEAN,
    manufacture_date DATE
);
""")

conn.close()

Creating Schemas and Adding Constraints

Schemas help organize your tables into logical groups:

conn.sql("CREATE SCHEMA IF NOT EXISTS my_schema;")
conn.sql("CREATE TABLE my_schema.customers (customer_id INTEGER);")

Constraints ensure data integrity:

conn.sql("""
CREATE TABLE products (
    product_id INTEGER PRIMARY KEY,
    product_name VARCHAR(255) NOT NULL,
    price DOUBLE,
    discounted_price DOUBLE GENERATED ALWAYS AS (price * 0.9)
);
""")

Here, PRIMARY KEY ensures each product has a unique ID, NOT NULL ensures every product has a name, and the computed column automatically calculates the discounted price.

Basic DuckDB API Usage

After executing a query, you’ll need to access the results. DuckDB provides several methods for this purpose.

Fetching Query Results

Using `fetchall()` to Retrieve All Rows


# Execute a query
conn.execute("SELECT * FROM products WHERE price > 30")


# Fetch all rows as a list of tuples
results = conn.fetchall()


# Process the results
for row in results:
    print(f"Product: {row[1]}, Price: ${row[2]}")

Using `fetchdf()` to Retrieve Results as a Pandas DataFrame


# Execute a query
conn.execute("SELECT * FROM products WHERE price > 30")


# Fetch results as a Pandas DataFrame
df = conn.fetchdf()


# Process the DataFrame
print(df)
print(f"Average price: ${df['price'].mean():.2f}")

Using `arrow()` to Retrieve Results as an Arrow Table


# Execute a query
conn.execute("SELECT * FROM products WHERE price > 30")


# Fetch results as an Arrow table
arrow_table = conn.arrow()


# Convert to a Pandas DataFrame if needed
df = arrow_table.to_pandas()
print(df)

Parameter Binding Options

DuckDB supports both positional and named parameter binding:

Positional Parameter Binding

user_id = 1
username = "alice"
email = "alice@example.com"

conn.execute(
    "INSERT INTO users VALUES (?, ?, ?)",
    [user_id, username, email]
)

Named Parameter Binding

user_data = {
    "id": 1,
    "username": "alice",
    "email": "alice@example.com"
}

conn.execute(
    "INSERT INTO users VALUES (:id, :username, :email)",
    user_data
)

Error Handling and Transaction Management

Proper error handling is essential for building robust applications:

try:
    # Begin transaction
    conn.execute("BEGIN TRANSACTION")

    # Deduct from Alice's account
    conn.execute("""
    UPDATE accounts
    SET balance = balance - 200
    WHERE owner = 'Alice'
    """)

    # Add to Bob's account
    conn.execute("""
    UPDATE accounts
    SET balance = balance + 200
    WHERE owner = 'Bob'
    """)

    # Commit transaction
    conn.execute("COMMIT")
    print("Transaction committed successfully!")

except Exception as e:
    # Rollback transaction on error
    conn.execute("ROLLBACK")
    print(f"Transaction rolled back due to error: {e}")

Transactions ensure that multiple operations either all succeed or all fail, maintaining data integrity.

Summary

In this article, we’ve covered the foundational concepts for working with DuckDB:

1.Establishing connections- Choose between in-memory (temporary) and persistent (long-term) storage based on your needs. 2.Executing SQL queries- Use .sql() for general queries and .execute() with parameter binding for security and efficiency. 3.Defining data structures- Create tables with appropriate data types and constraints to ensure data integrity. 4.Fetching results- Retrieve query results using fetchall(), fetchdf(), or arrow() based on your requirements. 5.Handling errors- Implement robust error handling and transaction management to prevent data corruption.

These core concepts form the building blocks for everything else you’ll do with DuckDB. In our next article, we’ll dive deeper into the DuckDB Python API, exploring advanced query techniques and optimization strategies.

By mastering these fundamentals, you’re well on your way to using DuckDB’s full potential for fast, efficient data analysis. Happy querying!

That was some yummy duck! Do you fancy more Duck?

About the Author

Rick Hightower is a seasoned software architect and data engineering expert with over two decades of experience in building scalable data solutions. As a technical author and consultant, he specializes in data architecture, ETL processes, and modern data stack implementation.

With deep expertise in SQL, database optimization, and distributed systems, Rick has helped numerous organizations design and implement robust data architectures. He frequently writes about data engineering, focusing on practical applications of technologies like DuckDB, SQL optimization, and ETL processes.

Rick is passionate about sharing knowledge and making complex technical concepts accessible to practitioners at all levels. Through his articles and consulting work, he continues to contribute to the data engineering community’s growth and development.

comments powered by Disqus

Apache Spark Training
Kafka Tutorial
Akka Consulting
Cassandra Training
AWS Cassandra Database Support
Kafka Support Pricing
Cassandra Database Support Pricing
Non-stop Cassandra
Watchdog
Advantages of using Cloudurable™
Cassandra Consulting
Cloudurable™| Guide to AWS Cassandra Deploy
Cloudurable™| AWS Cassandra Guidelines and Notes
Free guide to deploying Cassandra on AWS
Kafka Training
Kafka Consulting
DynamoDB Training
DynamoDB Consulting
Kinesis Training
Kinesis Consulting
Kafka Tutorial PDF
Kubernetes Security Training
Redis Consulting
Redis Training
ElasticSearch / ELK Consulting
ElasticSearch Training
InfluxDB/TICK Training TICK Consulting

Core Concepts: Connecting to DuckDB and Executing SQL

Core Concepts: Connecting to DuckDB and Executing SQL

Establishing Database Connections

In-Memory Database

Persistent Database

Choosing Between In-Memory and Persistent

Executing SQL Queries from Python

The `.sql()` Method

The `.execute()` Method with Parameter Binding

Basic Data Definition Language (DDL)

Creating Tables

Choosing Data Types

Creating Schemas and Adding Constraints

Basic DuckDB API Usage

Fetching Query Results

Using `fetchall()` to Retrieve All Rows

Using `fetchdf()` to Retrieve Results as a Pandas DataFrame

Using `arrow()` to Retrieve Results as an Arrow Table

Parameter Binding Options

Positional Parameter Binding

Named Parameter Binding

Error Handling and Transaction Management

Summary

About the Author

Search

Share

Follow

Categories

Tags

Core Concepts Connecting to DuckDB and Executing S

Core Concepts: Connecting to DuckDB and Executing SQL

Core Concepts: Connecting to DuckDB and Executing SQL

Establishing Database Connections

In-Memory Database

Persistent Database

Choosing Between In-Memory and Persistent

Executing SQL Queries from Python

The .sql() Method

The .execute() Method with Parameter Binding

Basic Data Definition Language (DDL)

Creating Tables

Choosing Data Types

Creating Schemas and Adding Constraints

Basic DuckDB API Usage

Fetching Query Results

Using fetchall() to Retrieve All Rows

Using fetchdf() to Retrieve Results as a Pandas DataFrame

Using arrow() to Retrieve Results as an Arrow Table

Parameter Binding Options

Positional Parameter Binding

Named Parameter Binding

Error Handling and Transaction Management

Summary

About the Author

Search

Share

Follow

Categories

Tags

The `.sql()` Method

The `.execute()` Method with Parameter Binding

Using `fetchall()` to Retrieve All Rows

Using `fetchdf()` to Retrieve Results as a Pandas DataFrame

Using `arrow()` to Retrieve Results as an Arrow Table