Python · Deep Learning · Machine Learning · CUNY DATA 602

Forecasting & Classifying
Air Quality in NYC

A data-driven analysis of 45 years of EPA air quality data across all five NYC boroughs (1980–2025). Using an ELT pipeline through MySQL, LSTM deep learning for time series forecasting, and five classification algorithms to categorize AQI levels — connecting environmental data science to public health insight.

Python LSTM LightGBM Random Forest MySQL EPA Data Public Health CUNY DATA 602
Language
Python · SQL
Dataset
EPA AQI Daily — 77,083 records · 1980–2025
Models
LSTM · LightGBM · Random Forest · KNN · SVC · Logistic Regression
Course
CUNY DATA 602
Status
Published · Live
01About

As a data analyst at the New York State Department of Health, environmental air quality is directly relevant to the public health work I do daily. This project grew from that context — I wanted to apply data science rigorously to a domain with real public health stakes, using 45 years of EPA-published AQI data across the Bronx, Kings, New York, Queens, and Richmond counties.

The full data science lifecycle is covered: ELT pipeline through MySQL to consolidate 45 years of individual CSV files into a queryable database, Python-based loading and filtering via mysql.connector, followed by cleaning, feature engineering, EDA, and modeling across two distinct problem types — forecasting and classification.

Three research questions structure the work. The first characterizes long-term AQI trends and seasonal patterns — including the 2023 spike above 200 linked to Canadian wildfires and the overall downward trend since the 1980s. The second asks whether LSTM deep learning can forecast future AQI values — it can, with decent accuracy for an inherently noisy environmental signal. The third compares five classification algorithms for predicting AQI category without using the AQI value itself — testing whether models can learn the structural patterns that produce each category.

The dataset contains 77,083 instances with no null values, spanning five counties and covering both routine pollution days and extreme events. The mean AQI of 51.8 confirms NYC has operated at "Moderate" air quality on average across the full period.

02Research Questions

Three questions structure the analysis — one exploratory, one forecasting, one classification. Each is addressed with appropriate methodology and evaluated against standard performance metrics for its problem type.

RQ 1
What were the AQI trends and seasonal patterns across NYC counties from 1980 to 2025?
  • Trend Overall AQI shows a downward trend since the 1980s peak, with recent years consistently below 50 — classified as "Good."
  • Seasonal AQI peaks in July driven by ozone formation and summer wildfires; lowest in October. Winter months show stable moderate levels.
  • Event 2023 spike over 200 AQI linked to Canadian wildfire smoke — confirmed by NYS DOH asthma ER visit reports from the same period.
  • County New York County has the highest average AQI (64.2); Kings County the lowest (44.2), followed closely by Queens (45.8).
RQ 2
Can deep learning time series models forecast future AQI levels accurately?
  • Model Single-layer LSTM (50 units) trained with Adam optimizer on 80/20 train-test split over 20 epochs with MSE loss.
  • MAE Mean Absolute Error of 8.16 AQI units — acceptable for environmental forecasting where daily values fluctuate widely.
  • R² of 0.56 — the model captures major temporal trends and seasonal patterns but struggles with sharp pollution spikes.
  • Note Training and validation losses converged without overfitting. Mild underfitting suggests room to improve with meteorological features or deeper architectures.
RQ 3
Can machine learning predict AQI categories using features other than the AQI value itself?
  • Best LightGBM achieved the top F1-score of 0.59 and accuracy of 0.58 — best precision-recall balance across all six AQI categories.
  • 2nd KNN and Random Forest both matched at F1 0.57, outperforming linear models by better capturing non-linear feature interactions.
  • Limit Logistic Regression (F1: 0.33) and SVC (F1: 0.42) struggled with class imbalance and complex non-linear relationships.
  • CV All models tuned via TimeSeriesSplit GridSearchCV to prevent data leakage while preserving temporal sequence in cross-validation.
03Key Results
LSTM Forecasting Performance
8.16 MAE (AQI units)
11.65 RMSE
28.4% MAPE
0.56 R² Score
Classification Model Comparison
Classifier F1 Score Accuracy Notes
LightGBM 0.59 0.58 Best overall — precision/recall balance
KNN 0.57 0.55 Strong non-linear pattern capture
Random Forest 0.57 0.53 Comparable to KNN; ensemble robustness
SVC 0.42 0.37 Struggled with class imbalance
Logistic Regression 0.33 0.35 Underperformed on nonlinear relationships
04What I practiced
Language Python 3.13.5 · SQL
Environment Jupyter Notebook 7.3.2 · MySQL Workbench 8.0
Data pandas · mysql.connector · numpy
EDA / Viz matplotlib · seaborn
Deep Learning Keras (LSTM) · TensorFlow · MinMaxScaler
ML scikit-learn · LightGBM · GridSearchCV · TimeSeriesSplit
Dataset EPA AQI Daily by County · 77,083 records · 5 NYC counties · 1980–2025

Explore the full project

Full Jupyter Notebook and SQL schema available on GitHub.