Skip to main content
Portfolio Context

Artifact type: Technical case study
Audience: Social science researchers working with public datasets
Role: Documentation author

Building a Scalable Data Ingestion Pipeline for Public Datasets

Project Metadata

  • Primary Tools: Python (Pandas, Requests), MongoDB, Census API, NHTSA API
  • Role: Data Engineering & Research Design
  • Keywords: ETL Pipeline, Public Data, API Integration, Urban Informatics

Overview

This case study describes a data ingestion pipeline I developed during a research project combining U.S. Census American Community Survey (ACS) data with NHTSA Automated Vehicle (AV) test pilot data.

Public datasets in this domain are typically retrieved through browser portals and processed in spreadsheets, but those methods proved unreliable at the scale required for this study.

To support bulk extraction and transformation of the full datasets, the workflow was redesigned to use direct API access, Python preprocessing scripts, and database storage. This approach allowed the datasets to be retrieved programmatically, cleaned before ingestion, and stored in a structured format suitable for analysis. This transition reduced acquisition time from hours to a 10-minute automated routine, ensuring 100% data provenance.

Problem Context

The research project required combining two public datasets:

  1. Automated Vehicle (AV) test pilot records from the National Highway Traffic Safety Administration (NHTSA)
  2. City-level demographic data from the U.S. Census American Community Survey (ACS)

The objective was to build a database allowing me to correlate testing activity with local census data to identify geographic trends in AV deployment.

Both sources provided browser-based portals that allowed CSV exports. These interfaces worked for small queries but created problems when assembling complete datasets:

  • Repeated manual filtering and downloads were required
  • Exports often froze during large queries
  • It was difficult to verify dataset completeness
  • Manual CSV aggregation increased the risk of errors

ACS exports contained hundreds of columns and frequently caused Excel to crash during filtering and transformation. Manual aggregation of CSV files also introduced a risk of incomplete or corrupted datasets.

This led to redesigning the workflow around APIs, scripted processing, and structured storage.

System Constraints

Several constraints shaped the pipeline design.

Dataset scale

ACS tables contain 30,000+ records across hundreds of columns, making spreadsheet filtering and transformation unstable.

Data completeness

The research required extracting the full dataset from each source. Browser portals expose query interfaces rather than the complete underlying dataset, making completeness difficult to verify.

Reproducibility

The dataset needed to be regenerated periodically as new AV records and census releases became available. Manual workflows would make this difficult.

Data transformation complexity

ACS tables include multiple metadata fields such as margins of error and annotations. Most analysis only required the estimate values.

Time and tooling constraints

The initial design had to balance correctness with implementation time. While sophisticated data engineering solutions were possible, the pipeline needed to be built quickly using accessible tools and scripts that could be maintained throughout the research process.

The Problem: Manual Data Wrangling Workflow

System Architecture Overview

The final pipeline retrieves datasets from the U.S. Census ACS API and the NHTSA AV Test API, processes the raw data using Python scripts, and stores normalized records in MongoDB for analysis.

The system separates three responsibilities:

  • Data acquisition from external APIs
  • Data transformation through Python scripts that clean and normalize records
  • Data storage in a database used for analysis and future updates

The following diagram shows the pipeline structure.

Data Pipeline Overview

The workflow evolved from browser-based table processing to an API-driven pipeline.

Component Responsibilities

Census API

Provides demographic data by geographic unit.

NHTSA AV Test API

Provides records of automated vehicle testing programs, including location, participant organizations, and operational details.

Python Transformation Layer

Scripts handle:

  • Iterating through AV record IDs
  • Filtering unnecessary metadata columns
  • Mapping coded column names to human-readable labels
  • Generating cleaned CSV datasets

API Integration Design

The pipeline integrated two data sources:

  • U.S. Census ACS API
  • NHTSA Automated Vehicle Test Pilot API

Each required a different extraction strategy.

Census API Integration

The Census API provides access to American Community Survey datasets.

The pipeline used the ACS 5-year dataset, which aggregates survey data across five years to produce stable estimates for smaller geographic areas such as cities.

The API is accessed through parameterized URLs that specify:

  • Dataset year
  • Table group
  • Variables
  • Geographic scope

Example query:

https://api.census.gov/data/2022/acs/acs5/subject?get=NAME,GEO_ID,S0101_C01_001E&for=place:*

This returns the city name, geographic identifier, and population estimate.

Full tables were retrieved using group queries:

https://api.census.gov/data/2022/acs/acs5/subject?get=group(S0101)&for=place:*

This request returns the complete Age by Sex table for all U.S. cities.

Because Census variables are represented by coded identifiers, a separate metadata endpoint was used to retrieve variable descriptions.

Metadata query: https://api.census.gov/data/2022/acs/acs5/subject/variables

This metadata file maps coded variable identifiers to descriptive labels used during transformation.

AV Test Pilot API Integration

The NHTSA API uses a different structure.

Instead of returning a single table containing all records, the API first returns a list of tracker IDs, which must then be queried individually to retrieve each full record.

Initial endpoint:

https://avtest.nhtsa.dot.gov/api/trackers

In this dataset, the endpoint returned 234 tracker IDs.

Each record was then retrieved through:

https://avtest.nhtsa.dot.gov/api/trackers/{tracker_id}

Because this required one request per tracker ID, a Python script was used to automate retrieval.

tracker_ids = requests.get(TRACKER_ENDPOINT).json()  

records = []
for tracker_id in tracker_ids:
response = requests.get(f"{TRACKER_ENDPOINT}/{tracker_id}")
records.append(response.json())

The script also used graceful backoff and logged unusually slow requests during execution.

This approach allowed the full dataset to be assembled without manual interaction.

Data Transformation Rules

Python scripts handled:

  • Metadata filtering
  • Variable translation
  • Column normalization

Column Filtering

ACS tables include estimate values, margins of error, and annotation fields.

Columns containing margin-of-error (_M) and annotation (_A) fields were excluded during preprocessing. For most analysis, only the estimate values were needed.

df = df.loc[:, ~df.columns.str.contains("_M|_A")]

Variable Name Translation

Census datasets use coded identifiers such as:

S0101_C01_001E

The metadata file was used to map these identifiers to readable labels.

Example mapping:

Original VariableTranslated Label
S0101_C01_001ETotal Population
S0101_C01_002EMale Population
S0101_C01_026EPopulation Age 25–29

This mapping was implemented using a lookup table loaded into the Python script.

Header Normalization

Some CSV exports included formatting artifacts such as:

  • Bracket characters ([, ])
  • Quotation marks embedded in field names
  • Inconsistent metadata formatting

These artifacts prevented reliable database ingestion.

The transformation script cleaned these headers before writing the final dataset.

Output Dataset Structure

The cleaned dataset contained:

  • Normalized column names
  • Filtered estimate values
  • Consistent row structure

The resulting CSV files were then loaded into the database.

Storage Architecture Decision

Three storage options were evaluated:

  • Spreadsheets — used for early exploration but became unstable
  • PostgreSQL — considered for its geospatial support, but ACS CSV formatting artifacts (hundreds of columns, brackets, quotation marks) required heavy preprocessing
  • MongoDB — chosen for its schema flexibility to handle ACS tables exceeding dozens of columns. Schema-on-Read let the Python transformation layer dictate structure dynamically, avoiding rigid SQL migration overhead.

This "transform-then-load" approach simplified the entire pipeline.

Pipeline Outcome

The completed pipeline produced:

  • Demographic datasets for 30,000+ U.S. cities using ACS 5-year surveys
  • Structured records of 200+ AV testing programs
  • Normalized datasets suitable for database storage and analysis

The final output provided a unified dataset for spatial analysis of AV testing footprints relative to city-level socioeconomic indicators.

Lessons Learned

APIs Are More Reliable for Large Dataset Retrieval

Browser portals are adequate for spot queries but hit practical limits at scale. API access is worth the setup cost when dataset size or retrieval frequency makes manual exports impractical.

Custom Scripts Can Unlock Difficult Datasets

Default interfaces reflect the most common use case, not the full capability of the underlying data. When a dataset seems impractical to work with, the constraint may be the interface, not the data itself.

Separate Retrieval, Transformation, and Storage Stages

The common workflow for working with ACS datasets relies on browser portals and spreadsheet software, which treat acquisition and transformation as a single event. With large tables, the portal frequently froze or failed before an export completed — and a failed transformation invalidated the underlying data.

Separating the pipeline into distinct stages — retrieval, cleaning, and storage — resolved both problems. Raw tables were retrieved programmatically, cleaned using scripts, and stored in a normalized structure before analysis. A failed transformation script didn't require re-fetching the data; it stayed in MongoDB, ready to reprocess.