Skip to the content.

SQL on HPC: Querying Large Databases

This guide walks you through running SQL queries on a large dataset using HPC. The dataset is too big for most laptops — that’s the point.

Dataset: NYC Yellow Taxi trips (2023) — ~20 million trips, ~9 GB total on disk

For instructors: This is a template. Adapt the queries to fit your course. The dataset and setup are ready to go.


Prerequisites


Step 1: Log into the HPC System

ssh your_username@vista.tacc.utexas.edu

Replace vista with your system. See the Jupyter on HPC guide if you need help with SSH.


Step 2: Clone the Repo and Set Up the Data

cd $WORK
git clone https://github.com/morehouse-supercomputing/sql-on-hpc.git
cd sql-on-hpc
bash scripts/setup_data.sh

This downloads 12 months of NYC taxi data (~3 GB), loads it into a SQLite database (~6 GB), and creates indexes. About ~9 GB total disk. Takes 10-15 minutes.

Instructor tip: Run this before class so students don’t wait. You can clone once to a shared directory, or have each student run it on their own $WORK.


Step 3: Open the Database

From the command line

sqlite3 data/nyc_taxi.db

You’re now in an interactive SQL shell. Try:

SELECT COUNT(*) FROM trips;

You should see ~20 million rows.

From a Jupyter notebook

If you’re using Jupyter on HPC (see the Jupyter guide):

import sqlite3
conn = sqlite3.connect('data/nyc_taxi.db')
cursor = conn.cursor()

cursor.execute("SELECT COUNT(*) FROM trips")
print(cursor.fetchone()[0])

The Database

Tables

trips — ~20 million rows

Column Type Description
pickup_time TEXT Pickup date and time
dropoff_time TEXT Dropoff date and time
passengers INTEGER Number of passengers
distance_miles REAL Trip distance
pickup_zone_id INTEGER Pickup location (zone ID)
dropoff_zone_id INTEGER Dropoff location (zone ID)
fare REAL Base fare amount
tip REAL Tip amount
total REAL Total charged
payment_type INTEGER 1=Credit card, 2=Cash, 3=No charge, 4=Dispute

zones — 70 rows (Manhattan zones with names)

Column Type Description
zone_id INTEGER Zone ID (matches pickup/dropoff zone IDs)
zone_name TEXT Neighborhood name

Indexes

Indexes exist on: pickup_time, pickup_zone_id, dropoff_zone_id, distance_miles, fare


Example Queries

Basic: Counting and Filtering

How many trips are in the database?

SELECT COUNT(*) FROM trips;

How many trips had more than 4 passengers?

SELECT COUNT(*) FROM trips WHERE passengers > 4;

What’s the average fare?

SELECT ROUND(AVG(fare), 2) AS avg_fare FROM trips;

What’s the longest trip by distance?

SELECT distance_miles, fare, total, pickup_time
FROM trips
ORDER BY distance_miles DESC
LIMIT 10;

Intermediate: Aggregation and Grouping

Average fare by passenger count:

SELECT passengers,
       COUNT(*) AS num_trips,
       ROUND(AVG(fare), 2) AS avg_fare,
       ROUND(AVG(tip), 2) AS avg_tip
FROM trips
WHERE passengers > 0
GROUP BY passengers
ORDER BY passengers;

Trips per month:

SELECT SUBSTR(pickup_time, 1, 7) AS month,
       COUNT(*) AS num_trips,
       ROUND(SUM(total), 2) AS total_revenue
FROM trips
GROUP BY month
ORDER BY month;

Busiest pickup zones (top 10):

SELECT pickup_zone_id,
       COUNT(*) AS num_trips
FROM trips
GROUP BY pickup_zone_id
ORDER BY num_trips DESC
LIMIT 10;

Advanced: Joins, Subqueries, and Window Functions

Busiest zones with names (JOIN):

SELECT z.zone_name,
       COUNT(*) AS num_trips,
       ROUND(AVG(t.fare), 2) AS avg_fare
FROM trips t
JOIN zones z ON t.pickup_zone_id = z.zone_id
GROUP BY z.zone_name
ORDER BY num_trips DESC
LIMIT 10;

Which zones have the highest average tips? (JOIN + GROUP BY + ORDER)

SELECT z.zone_name,
       COUNT(*) AS num_trips,
       ROUND(AVG(t.tip), 2) AS avg_tip,
       ROUND(AVG(t.fare), 2) AS avg_fare
FROM trips t
JOIN zones z ON t.pickup_zone_id = z.zone_id
GROUP BY z.zone_name
HAVING num_trips > 10000
ORDER BY avg_tip DESC
LIMIT 10;

Airport trips vs non-airport trips (subquery):

SELECT
    CASE
        WHEN pickup_zone_id IN (132, 138, 1) THEN 'Airport'
        ELSE 'Non-Airport'
    END AS trip_type,
    COUNT(*) AS num_trips,
    ROUND(AVG(fare), 2) AS avg_fare,
    ROUND(AVG(distance_miles), 2) AS avg_distance,
    ROUND(AVG(tip), 2) AS avg_tip
FROM trips
GROUP BY trip_type;

Credit card vs cash tipping behavior:

SELECT
    CASE payment_type
        WHEN 1 THEN 'Credit Card'
        WHEN 2 THEN 'Cash'
        WHEN 3 THEN 'No Charge'
        WHEN 4 THEN 'Dispute'
        ELSE 'Other'
    END AS payment_method,
    COUNT(*) AS num_trips,
    ROUND(AVG(tip), 2) AS avg_tip,
    ROUND(AVG(fare), 2) AS avg_fare,
    ROUND(AVG(tip) / NULLIF(AVG(fare), 0) * 100, 1) AS tip_pct
FROM trips
GROUP BY payment_type
ORDER BY num_trips DESC;

Hourly trip patterns (time extraction):

SELECT CAST(SUBSTR(pickup_time, 12, 2) AS INTEGER) AS hour_of_day,
       COUNT(*) AS num_trips,
       ROUND(AVG(fare), 2) AS avg_fare
FROM trips
WHERE pickup_time IS NOT NULL
GROUP BY hour_of_day
ORDER BY hour_of_day;

Why HPC?

This database has ~20 million rows and takes up ~9 GB on disk. Here’s why you’d want HPC for this:

  Your Laptop HPC Compute Node
RAM 8-16 GB 128-223 GB
Disk Limited SSD 1 TB+ on $WORK
Full table scan on 20M rows Slow, might swap to disk Fast, fits in memory
Multiple queries at once Bogs down Plenty of headroom

A query that scans all 20 million rows needs to load the database into memory. At ~9 GB, a laptop with 8 GB of RAM can’t hold it without swapping to disk. On HPC with 223 GB of RAM, 9 GB is a rounding error.

For instructors: The real teaching moment is when a student runs a query on their laptop and it takes 30 seconds, then runs the same query on HPC and it takes 2 seconds. That’s the “aha.”


Tips for Instructors

Before class:

Adapting for your course:

Assessment ideas:


Troubleshooting

“sqlite3: command not found”

“setup_data.sh is taking forever”

“database is locked”

“disk quota exceeded”


Quick Reference

Task Command
Open database sqlite3 data/nyc_taxi.db
Count rows SELECT COUNT(*) FROM trips;
Show tables .tables
Show columns .schema trips
Pretty output .mode column then .headers on
Export to CSV .mode csv then .output results.csv then run query
Exit SQLite .quit