4 minute read

Large-Scale Data Integration & Statistical Inference

I led a technical data analysis project evaluating the long-term economic return on investment (ROI) of transitioning U.S. states from fossil-fuel electricity to locally available renewable energy. The pipeline integrates multiple federal datasets and computes three complementary ROI metrics: per-MWh efficiency, total state economic impact, and per-capita equity.

The Challenge: Integrating Disparate Data Sources

Evaluating state-level renewable transitions requires combining:

  • EPA eGRID: Current electricity generation mix by state
  • EIA Data: Historical electricity prices and consumption patterns
  • NREL Renewable Potential: Solar, wind, geothermal, hydro resource quality
  • DOE USEER: Energy sector employment and economic data
  • Census Data: Population, land area, demographic information

Each dataset uses different state identifiers, units, temporal resolutions, and reporting standards. The first technical challenge was building a robust data integration pipeline that handled these inconsistencies without losing information.

System Architecture

1. Data Ingestion & Cleaning

  • Automated download and parsing of federal datasets
  • Standardized state identifiers (FIPS codes, names, abbreviations)
  • Unit conversions and temporal alignment
  • Missing data imputation using domain-appropriate methods

2. Feature Engineering

Built composite features from raw data:

  • Current Fossil Fuel Dependence: % of electricity from coal, natural gas, petroleum
  • Renewable Resource Quality: Weighted average of solar, wind, hydro potential
  • Grid Infrastructure Readiness: Transmission capacity, interconnection density
  • Economic Baseline: Current energy costs, employment, state GDP contribution

3. State Feasibility Factor (SFF)

To ground technical potential in operational reality, I introduced a State Feasibility Factor that incorporates:

  • Renewable resource quality (physics constraint)
  • Grid readiness (infrastructure constraint)
  • Current renewable adoption (momentum indicator)
  • Population density (distribution efficiency)

The SFF weights technical ROI by how achievable it actually is for each state.

4. ROI Metric Computation

Three complementary perspectives:

A. Per-MWh Efficiency

  • Cost savings per unit energy transitioned
  • Rewards resource quality and low implementation costs
  • Use Case: Identifies states with best marginal returns

B. Total State Economic Impact

  • Absolute dollar value of statewide transition
  • Accounts for state size and consumption
  • Use Case: Prioritizes large-scale economic benefits

C. Per-Capita Equity

  • Economic benefit per resident
  • Normalizes by population
  • Use Case: Ensures small states aren’t overlooked in national policy

Statistical Validation & Robustness

To ensure results were defensible, I applied rigorous statistical testing:

1. Correlation Significance Testing

  • Identified which factors most strongly predict ROI
  • Controlled for multiple comparisons (Bonferroni correction)
  • Reported confidence intervals, not just point estimates

2. Outlier Detection & Analysis

  • Flagged statistical outliers using robust z-scores
  • Investigated physical causes (e.g., Hawaii’s unique energy economics)
  • Reported results with and without outliers for transparency

3. Sensitivity Analysis

Tested robustness to modeling assumptions:

  • Cost Assumptions: Varied solar/wind installation costs by ±30%
  • Resource Quality Weighting: Tested different aggregation methods
  • Discount Rates: Evaluated long-term ROI under different economic scenarios

Result, Core rankings remained stable across reasonable assumption ranges, indicating robust conclusions.

4. Cross-Validation Against External Benchmarks

Compared results to:

  • Actual state renewable adoption rates (correlation = 0.71)
  • Independent economic analyses from NREL and EIA
  • State energy policy rankings from external organizations

Key Results

Top ROI States (Per-MWh Efficiency)

  1. Wyoming: Exceptional wind resources, low population density
  2. New Mexico: High solar potential, low installation costs
  3. Texas: Massive scale, diverse renewables, existing infrastructure

Highest Total Economic Impact

  1. California: Largest energy consumption, strong solar/wind
  2. Texas: Scale + resource quality
  3. Florida: High electricity demand, excellent solar potential

Best Per-Capita Returns

  1. North Dakota: Wind-rich, low population
  2. Wyoming: Similar profile to ND
  3. Montana: Hydro + wind potential

Statistical Insights

  • Correlation Analysis: Wind potential is the strongest predictor of per-MWh ROI (r = 0.68)
  • Regional Patterns: Southwest states dominate solar ROI, Great Plains lead in wind
  • Grid Readiness: States with existing renewable infrastructure show 2.3x higher near-term ROI

Why This Matters for ML/Space Roles

This project demonstrates skills directly transferable to space industry work:

  • Multi-Source Data Integration: Space missions combine telemetry, ground observations, simulation, same integration challenges
  • Statistical Rigor: Mission trade studies require defensible analysis under uncertainty
  • Decision Support Modeling: Translating technical potential → operational recommendations
  • Reproducibility: All analysis is version-controlled, documented, and replicable
  • Systems Thinking: Understanding constraints beyond the technical (cost, infrastructure, policy)

These are the same patterns used in:

  • Mission feasibility analysis
  • Spacecraft design trade studies
  • Orbital mechanics optimization
  • Ground system capacity planning

Technical Stack

  • Python: pandas, NumPy, SciPy (core analysis)
  • Matplotlib / Seaborn: Visualization
  • Statsmodels: Statistical testing, regression analysis
  • Geopandas: Spatial analysis and mapping
  • Jupyter: Reproducible analysis notebooks
  • Git: Version control for data + code

Reproducibility & Open Science

The entire analysis is:

  • Version-controlled: Git repository with full commit history
  • Documented: Markdown documentation for every decision
  • Reproducible: Jupyter notebooks with step-by-step execution
  • Transparent: Assumptions, limitations, and uncertainties explicitly stated

While originally developed in an academic setting, the project is structured and presented as a standalone technical analysis, not coursework.

Current Status

Complete, Fully Reproducible Analysis

All code, data, and documentation are available. The methodology is extensible to:

  • Updated federal datasets (annual releases)
  • International comparisons
  • More granular regional analysis
  • Integration with climate models

Code Repository

View on GitHub (link to be added)


Key Insight, Good data science isn’t just about getting an answer, it’s about building confidence that the answer is right. Statistical validation, sensitivity analysis, and reproducibility are how you earn that confidence.