Artifact type: Technical case study
Audience: Social science researchers working with public datasets
Role: Documentation author
Building a Scalable Data Ingestion Pipeline for Public Datasets
Project Metadata
- Primary Tools: Python (Pandas, Requests), MongoDB, Census API, NHTSA API
- Role: Data Engineering & Research Design
- Keywords: ETL Pipeline, Public Data, API Integration, Urban Informatics
Overview
This case study describes a data ingestion pipeline I developed during a research project combining U.S. Census American Community Survey (ACS) data with NHTSA Automated Vehicle (AV) test pilot data.
Public datasets in this domain are typically retrieved through browser portals and processed in spreadsheets, but those methods proved unreliable at the scale required for this study.
To support bulk extraction and transformation of the full datasets, the workflow was redesigned to use direct API access, Python preprocessing scripts, and database storage. This approach allowed the datasets to be retrieved programmatically, cleaned before ingestion, and stored in a structured format suitable for analysis. This transition reduced acquisition time from hours to a 10-minute automated routine, ensuring 100% data provenance.
Problem Context
The research project required combining two public datasets:
- Automated Vehicle (AV) test pilot records from the National Highway Traffic Safety Administration (NHTSA)
- City-level demographic data from the U.S. Census American Community Survey (ACS)
The objective was to build a database allowing me to correlate testing activity with local census data to identify geographic trends in AV deployment.
Both sources provided browser-based portals that allowed CSV exports. These interfaces worked for small queries but created problems when assembling complete datasets:
- Repeated manual filtering and downloads were required
- Exports often froze during large queries
- It was difficult to verify dataset completeness
- Manual CSV aggregation increased the risk of errors
ACS exports contained hundreds of columns and frequently caused Excel to crash during filtering and transformation. Manual aggregation of CSV files also introduced a risk of incomplete or corrupted datasets.
This led to redesigning the workflow around APIs, scripted processing, and structured storage.
System Constraints
Several constraints shaped the pipeline design.
Dataset scale
ACS tables contain 30,000+ records across hundreds of columns, making spreadsheet filtering and transformation unstable.
Data completeness
The research required extracting the full dataset from each source. Browser portals expose query interfaces rather than the complete underlying dataset, making completeness difficult to verify.
Reproducibility
The dataset needed to be regenerated periodically as new AV records and census releases became available. Manual workflows would make this difficult.
Data transformation complexity
ACS tables include multiple metadata fields such as margins of error and annotations. Most analysis only required the estimate values.
Time and tooling constraints
The initial design had to balance correctness with implementation time. While sophisticated data engineering solutions were possible, the pipeline needed to be built quickly using accessible tools and scripts that could be maintained throughout the research process.
The Problem: Manual Data Wrangling Workflow
System Architecture Overview
The final pipeline retrieves datasets from the U.S. Census ACS API and the NHTSA AV Test API, processes the raw data using Python scripts, and stores normalized records in MongoDB for analysis.
The system separates three responsibilities:
- Data acquisition from external APIs
- Data transformation through Python scripts that clean and normalize records
- Data storage in a database used for analysis and future updates
The following diagram shows the pipeline structure.
Data Pipeline Overview
The workflow evolved from browser-based table processing to an API-driven pipeline.
Component Responsibilities
Census API
Provides demographic data by geographic unit.
NHTSA AV Test API
Provides records of automated vehicle testing programs, including location, participant organizations, and operational details.
Python Transformation Layer
Scripts handle:
- Iterating through AV record IDs
- Filtering unnecessary metadata columns
- Mapping coded column names to human-readable labels
- Generating cleaned CSV datasets
API Integration Design
The pipeline integrated two data sources:
- U.S. Census ACS API
- NHTSA Automated Vehicle Test Pilot API
Each required a different extraction strategy.
Census API Integration
The Census API provides access to American Community Survey datasets.
The pipeline used the ACS 5-year dataset, which aggregates survey data across five years to produce stable estimates for smaller geographic areas such as cities.
The API is accessed through parameterized URLs that specify:
- Dataset year
- Table group
- Variables
- Geographic scope
Example query:
https://api.census.gov/data/2022/acs/acs5/subject?get=NAME,GEO_ID,S0101_C01_001E&for=place:*
This returns the city name, geographic identifier, and population estimate.
Full tables were retrieved using group queries:
https://api.census.gov/data/2022/acs/acs5/subject?get=group(S0101)&for=place:*
This request returns the complete Age by Sex table for all U.S. cities.
Because Census variables are represented by coded identifiers, a separate metadata endpoint was used to retrieve variable descriptions.
Metadata query:
https://api.census.gov/data/2022/acs/acs5/subject/variables
This metadata file maps coded variable identifiers to descriptive labels used during transformation.
AV Test Pilot API Integration
The NHTSA API uses a different structure.
Instead of returning a single table containing all records, the API first returns a list of tracker IDs, which must then be queried individually to retrieve each full record.
Initial endpoint:
https://avtest.nhtsa.dot.gov/api/trackers
In this dataset, the endpoint returned 234 tracker IDs.
Each record was then retrieved through:
https://avtest.nhtsa.dot.gov/api/trackers/{tracker_id}
Because this required one request per tracker ID, a Python script was used to automate retrieval.
tracker_ids = requests.get(TRACKER_ENDPOINT).json()
records = []
for tracker_id in tracker_ids:
response = requests.get(f"{TRACKER_ENDPOINT}/{tracker_id}")
records.append(response.json())
The script also used graceful backoff and logged unusually slow requests during execution.
This approach allowed the full dataset to be assembled without manual interaction.
Data Transformation Rules
Python scripts handled:
- Metadata filtering
- Variable translation
- Column normalization
Column Filtering
ACS tables include estimate values, margins of error, and annotation fields.
Columns containing margin-of-error (_M) and annotation (_A) fields were excluded during preprocessing. For most analysis, only the estimate values were needed.
df = df.loc[:, ~df.columns.str.contains("_M|_A")]
Variable Name Translation
Census datasets use coded identifiers such as:
S0101_C01_001E
The metadata file was used to map these identifiers to readable labels.
Example mapping:
| Original Variable | Translated Label |
|---|---|
S0101_C01_001E | Total Population |
S0101_C01_002E | Male Population |
S0101_C01_026E | Population Age 25–29 |
This mapping was implemented using a lookup table loaded into the Python script.
Header Normalization
Some CSV exports included formatting artifacts such as:
- Bracket characters (
[,]) - Quotation marks embedded in field names
- Inconsistent metadata formatting
These artifacts prevented reliable database ingestion.
The transformation script cleaned these headers before writing the final dataset.
Output Dataset Structure
The cleaned dataset contained:
- Normalized column names
- Filtered estimate values
- Consistent row structure
The resulting CSV files were then loaded into the database.
Storage Architecture Decision
Three storage options were evaluated:
- Spreadsheets — used for early exploration but became unstable
- PostgreSQL — considered for its geospatial support, but ACS CSV formatting artifacts (hundreds of columns, brackets, quotation marks) required heavy preprocessing
- MongoDB — chosen for its schema flexibility to handle ACS tables exceeding dozens of columns. Schema-on-Read let the Python transformation layer dictate structure dynamically, avoiding rigid SQL migration overhead.
This "transform-then-load" approach simplified the entire pipeline.
Pipeline Outcome
The completed pipeline produced:
- Demographic datasets for 30,000+ U.S. cities using ACS 5-year surveys
- Structured records of 200+ AV testing programs
- Normalized datasets suitable for database storage and analysis
The final output provided a unified dataset for spatial analysis of AV testing footprints relative to city-level socioeconomic indicators.
Lessons Learned
APIs Are More Reliable for Large Dataset Retrieval
Browser portals are adequate for spot queries but hit practical limits at scale. API access is worth the setup cost when dataset size or retrieval frequency makes manual exports impractical.
Custom Scripts Can Unlock Difficult Datasets
Default interfaces reflect the most common use case, not the full capability of the underlying data. When a dataset seems impractical to work with, the constraint may be the interface, not the data itself.
Separate Retrieval, Transformation, and Storage Stages
The common workflow for working with ACS datasets relies on browser portals and spreadsheet software, which treat acquisition and transformation as a single event. With large tables, the portal frequently froze or failed before an export completed — and a failed transformation invalidated the underlying data.
Separating the pipeline into distinct stages — retrieval, cleaning, and storage — resolved both problems. Raw tables were retrieved programmatically, cleaned using scripts, and stored in a normalized structure before analysis. A failed transformation script didn't require re-fetching the data; it stayed in MongoDB, ready to reprocess.