SQL on HPC: Querying Large Databases
This guide walks you through running SQL queries on a large dataset using HPC. The dataset is too big for most laptops — that’s the point.
Dataset: NYC Yellow Taxi trips (2023) — ~20 million trips, ~9 GB total on disk
For instructors: This is a template. Adapt the queries to fit your course. The dataset and setup are ready to go.
Prerequisites
- An HPC account with MFA set up (MSCF Getting Started guide)
- Access to an HPC system (e.g., Vista, Lonestar6, Stampede3)
- An active allocation
Step 1: Log into the HPC System
ssh your_username@vista.tacc.utexas.edu
Replace
vistawith your system. See the Jupyter on HPC guide if you need help with SSH.
Step 2: Clone the Repo and Set Up the Data
cd $WORK
git clone https://github.com/morehouse-supercomputing/sql-on-hpc.git
cd sql-on-hpc
bash scripts/setup_data.sh
This downloads 12 months of NYC taxi data (~3 GB), loads it into a SQLite database (~6 GB), and creates indexes. About ~9 GB total disk. Takes 10-15 minutes.
Instructor tip: Run this before class so students don’t wait. You can clone once to a shared directory, or have each student run it on their own
$WORK.
Step 3: Open the Database
From the command line
sqlite3 data/nyc_taxi.db
You’re now in an interactive SQL shell. Try:
SELECT COUNT(*) FROM trips;
You should see ~20 million rows.
From a Jupyter notebook
If you’re using Jupyter on HPC (see the Jupyter guide):
import sqlite3
conn = sqlite3.connect('data/nyc_taxi.db')
cursor = conn.cursor()
cursor.execute("SELECT COUNT(*) FROM trips")
print(cursor.fetchone()[0])
The Database
Tables
trips — ~20 million rows
| Column | Type | Description |
|---|---|---|
| pickup_time | TEXT | Pickup date and time |
| dropoff_time | TEXT | Dropoff date and time |
| passengers | INTEGER | Number of passengers |
| distance_miles | REAL | Trip distance |
| pickup_zone_id | INTEGER | Pickup location (zone ID) |
| dropoff_zone_id | INTEGER | Dropoff location (zone ID) |
| fare | REAL | Base fare amount |
| tip | REAL | Tip amount |
| total | REAL | Total charged |
| payment_type | INTEGER | 1=Credit card, 2=Cash, 3=No charge, 4=Dispute |
zones — 70 rows (Manhattan zones with names)
| Column | Type | Description |
|---|---|---|
| zone_id | INTEGER | Zone ID (matches pickup/dropoff zone IDs) |
| zone_name | TEXT | Neighborhood name |
Indexes
Indexes exist on: pickup_time, pickup_zone_id, dropoff_zone_id, distance_miles, fare
Example Queries
Basic: Counting and Filtering
How many trips are in the database?
SELECT COUNT(*) FROM trips;
How many trips had more than 4 passengers?
SELECT COUNT(*) FROM trips WHERE passengers > 4;
What’s the average fare?
SELECT ROUND(AVG(fare), 2) AS avg_fare FROM trips;
What’s the longest trip by distance?
SELECT distance_miles, fare, total, pickup_time
FROM trips
ORDER BY distance_miles DESC
LIMIT 10;
Intermediate: Aggregation and Grouping
Average fare by passenger count:
SELECT passengers,
COUNT(*) AS num_trips,
ROUND(AVG(fare), 2) AS avg_fare,
ROUND(AVG(tip), 2) AS avg_tip
FROM trips
WHERE passengers > 0
GROUP BY passengers
ORDER BY passengers;
Trips per month:
SELECT SUBSTR(pickup_time, 1, 7) AS month,
COUNT(*) AS num_trips,
ROUND(SUM(total), 2) AS total_revenue
FROM trips
GROUP BY month
ORDER BY month;
Busiest pickup zones (top 10):
SELECT pickup_zone_id,
COUNT(*) AS num_trips
FROM trips
GROUP BY pickup_zone_id
ORDER BY num_trips DESC
LIMIT 10;
Advanced: Joins, Subqueries, and Window Functions
Busiest zones with names (JOIN):
SELECT z.zone_name,
COUNT(*) AS num_trips,
ROUND(AVG(t.fare), 2) AS avg_fare
FROM trips t
JOIN zones z ON t.pickup_zone_id = z.zone_id
GROUP BY z.zone_name
ORDER BY num_trips DESC
LIMIT 10;
Which zones have the highest average tips? (JOIN + GROUP BY + ORDER)
SELECT z.zone_name,
COUNT(*) AS num_trips,
ROUND(AVG(t.tip), 2) AS avg_tip,
ROUND(AVG(t.fare), 2) AS avg_fare
FROM trips t
JOIN zones z ON t.pickup_zone_id = z.zone_id
GROUP BY z.zone_name
HAVING num_trips > 10000
ORDER BY avg_tip DESC
LIMIT 10;
Airport trips vs non-airport trips (subquery):
SELECT
CASE
WHEN pickup_zone_id IN (132, 138, 1) THEN 'Airport'
ELSE 'Non-Airport'
END AS trip_type,
COUNT(*) AS num_trips,
ROUND(AVG(fare), 2) AS avg_fare,
ROUND(AVG(distance_miles), 2) AS avg_distance,
ROUND(AVG(tip), 2) AS avg_tip
FROM trips
GROUP BY trip_type;
Credit card vs cash tipping behavior:
SELECT
CASE payment_type
WHEN 1 THEN 'Credit Card'
WHEN 2 THEN 'Cash'
WHEN 3 THEN 'No Charge'
WHEN 4 THEN 'Dispute'
ELSE 'Other'
END AS payment_method,
COUNT(*) AS num_trips,
ROUND(AVG(tip), 2) AS avg_tip,
ROUND(AVG(fare), 2) AS avg_fare,
ROUND(AVG(tip) / NULLIF(AVG(fare), 0) * 100, 1) AS tip_pct
FROM trips
GROUP BY payment_type
ORDER BY num_trips DESC;
Hourly trip patterns (time extraction):
SELECT CAST(SUBSTR(pickup_time, 12, 2) AS INTEGER) AS hour_of_day,
COUNT(*) AS num_trips,
ROUND(AVG(fare), 2) AS avg_fare
FROM trips
WHERE pickup_time IS NOT NULL
GROUP BY hour_of_day
ORDER BY hour_of_day;
Why HPC?
This database has ~20 million rows and takes up ~9 GB on disk. Here’s why you’d want HPC for this:
| Your Laptop | HPC Compute Node | |
|---|---|---|
| RAM | 8-16 GB | 128-223 GB |
| Disk | Limited SSD | 1 TB+ on $WORK |
| Full table scan on 20M rows | Slow, might swap to disk | Fast, fits in memory |
| Multiple queries at once | Bogs down | Plenty of headroom |
A query that scans all 20 million rows needs to load the database into memory. At ~9 GB, a laptop with 8 GB of RAM can’t hold it without swapping to disk. On HPC with 223 GB of RAM, 9 GB is a rounding error.
For instructors: The real teaching moment is when a student runs a query on their laptop and it takes 30 seconds, then runs the same query on HPC and it takes 2 seconds. That’s the “aha.”
Tips for Instructors
Before class:
- Run
setup_data.shahead of time so students don’t wait 15 minutes - Consider cloning to a shared location on
$WORKif all students are on the same allocation - Test your custom queries on the actual data
Adapting for your course:
- The example queries cover SELECT, WHERE, GROUP BY, JOIN, HAVING, CASE, subqueries, and time extraction
- Add your own queries to match your syllabus
- The
zonestable is intentionally small (70 rows) to make JOINs fast for teaching - Payment type codes (1-4) are good for CASE WHEN exercises
Assessment ideas:
- “Write a query that finds the average tip percentage by zone for trips over 5 miles”
- “Which hour of the day has the most expensive average fare? Why might that be?”
- “Compare weekend vs weekday trip volumes” (requires date extraction)
Troubleshooting
“sqlite3: command not found”
- SQLite is usually available system-wide on HPC. Try
/usr/bin/sqlite3. If not, runmodule load gcc/13.2.0thenmodule load sqlite.
“setup_data.sh is taking forever”
- The download is ~3 GB. On HPC it should take 5-10 minutes. Make sure you’re on the login node (not a compute node) for better network access.
“database is locked”
- Only one process can write to SQLite at a time. If the setup script failed partway, delete the database and re-run:
rm data/nyc_taxi.db && bash scripts/setup_data.sh
“disk quota exceeded”
- You might be in
$HOME(10 GB limit). Make sure you cloned into$WORK.
Quick Reference
| Task | Command |
|---|---|
| Open database | sqlite3 data/nyc_taxi.db |
| Count rows | SELECT COUNT(*) FROM trips; |
| Show tables | .tables |
| Show columns | .schema trips |
| Pretty output | .mode column then .headers on |
| Export to CSV | .mode csv then .output results.csv then run query |
| Exit SQLite | .quit |