Python · Deep Learning · Machine Learning · CUNY DATA 602

Forecasting & Classifying
Air Quality in NYC

A data-driven analysis of 45 years of EPA air quality data across all five NYC boroughs (1980–2025). Using an ELT pipeline through MySQL, LSTM deep learning for time series forecasting, and five classification algorithms to categorize AQI levels — connecting environmental data science to public health insight.

Python LSTM LightGBM Random Forest MySQL EPA Data Public Health CUNY DATA 602

Language

Python · SQL

Dataset

EPA AQI Daily — 77,083 records · 1980–2025

Models

LSTM · LightGBM · Random Forest · KNN · SVC · Logistic Regression

Course

CUNY DATA 602

Status

Published · Live

Links

GitHub Repo ↗ EPA Data Source ↗

01About

As a data analyst at the New York State Department of Health, environmental air quality is directly relevant to the public health work I do daily. This project grew from that context — I wanted to apply data science rigorously to a domain with real public health stakes, using 45 years of EPA-published AQI data across the Bronx, Kings, New York, Queens, and Richmond counties.

The full data science lifecycle is covered: ELT pipeline through MySQL to consolidate 45 years of individual CSV files into a queryable database, Python-based loading and filtering via mysql.connector, followed by cleaning, feature engineering, EDA, and modeling across two distinct problem types — forecasting and classification.

Three research questions structure the work. The first characterizes long-term AQI trends and seasonal patterns — including the 2023 spike above 200 linked to Canadian wildfires and the overall downward trend since the 1980s. The second asks whether LSTM deep learning can forecast future AQI values — it can, with decent accuracy for an inherently noisy environmental signal. The third compares five classification algorithms for predicting AQI category without using the AQI value itself — testing whether models can learn the structural patterns that produce each category.

The dataset contains 77,083 instances with no null values, spanning five counties and covering both routine pollution days and extreme events. The mean AQI of 51.8 confirms NYC has operated at "Moderate" air quality on average across the full period.

02Research Questions

Three questions structure the analysis — one exploratory, one forecasting, one classification. Each is addressed with appropriate methodology and evaluated against standard performance metrics for its problem type.

RQ 1

What were the AQI trends and seasonal patterns across NYC counties from 1980 to 2025?

Trend Overall AQI shows a downward trend since the 1980s peak, with recent years consistently below 50 — classified as "Good."
Seasonal AQI peaks in July driven by ozone formation and summer wildfires; lowest in October. Winter months show stable moderate levels.
Event 2023 spike over 200 AQI linked to Canadian wildfire smoke — confirmed by NYS DOH asthma ER visit reports from the same period.
County New York County has the highest average AQI (64.2); Kings County the lowest (44.2), followed closely by Queens (45.8).

RQ 2

Can deep learning time series models forecast future AQI levels accurately?

Model Single-layer LSTM (50 units) trained with Adam optimizer on 80/20 train-test split over 20 epochs with MSE loss.
MAE Mean Absolute Error of 8.16 AQI units — acceptable for environmental forecasting where daily values fluctuate widely.
R² R² of 0.56 — the model captures major temporal trends and seasonal patterns but struggles with sharp pollution spikes.
Note Training and validation losses converged without overfitting. Mild underfitting suggests room to improve with meteorological features or deeper architectures.

RQ 3

Can machine learning predict AQI categories using features other than the AQI value itself?

Best LightGBM achieved the top F1-score of 0.59 and accuracy of 0.58 — best precision-recall balance across all six AQI categories.
2nd KNN and Random Forest both matched at F1 0.57, outperforming linear models by better capturing non-linear feature interactions.
Limit Logistic Regression (F1: 0.33) and SVC (F1: 0.42) struggled with class imbalance and complex non-linear relationships.
CV All models tuned via TimeSeriesSplit GridSearchCV to prevent data leakage while preserving temporal sequence in cross-validation.

03Key Results

LSTM Forecasting Performance

8.16 MAE (AQI units)

11.65 RMSE

28.4% MAPE

0.56 R² Score

Classification Model Comparison

Classifier	F1 Score	Accuracy	Notes
LightGBM	0.59	0.58	Best overall — precision/recall balance
KNN	0.57	0.55	Strong non-linear pattern capture
Random Forest	0.57	0.53	Comparable to KNN; ensemble robustness
SVC	0.42	0.37	Struggled with class imbalance
Logistic Regression	0.33	0.35	Underperformed on nonlinear relationships

04What I practiced

01 ELT pipeline with MySQL — Built a SQL schema and loaded 45 years of EPA CSV files into MySQL Workbench. Queried and filtered NYC county data into a pandas DataFrame using mysql.connector, practicing real database-driven data acquisition at scale.
02 Time series preprocessing for LSTM — Applied MinMaxScaler normalization, constructed sliding-window sequences, and split temporal data preserving chronological order to avoid data leakage in a deep learning context.
03 LSTM architecture and training — Designed, compiled, and trained a Keras LSTM network with Adam optimizer and MSE loss. Monitored training vs. validation loss convergence and diagnosed underfitting from the learning curves.
04 Feature engineering for classification — Extracted month and decade bins from date fields, applied one-hot encoding, and deliberately excluded AQI from the feature set so models learned category boundaries from structural signals only.
05 TimeSeriesSplit cross-validation — Used forward-chaining CV with GridSearchCV to tune hyperparameters across all five classifiers without breaking temporal order — a critical distinction from standard k-fold for time-dependent data.
06 Multi-model comparative analysis — Evaluated LightGBM, Random Forest, KNN, SVC, and Logistic Regression on the same dataset and time-split, producing a rigorous apples-to-apples comparison of tree-based, distance-based, kernel-based, and linear classifiers.

Language	Python 3.13.5 · SQL
Environment	Jupyter Notebook 7.3.2 · MySQL Workbench 8.0
Data	pandas · mysql.connector · numpy
EDA / Viz	matplotlib · seaborn
Deep Learning	Keras (LSTM) · TensorFlow · MinMaxScaler
ML	scikit-learn · LightGBM · GridSearchCV · TimeSeriesSplit
Dataset	EPA AQI Daily by County · 77,083 records · 5 NYC counties · 1980–2025

Explore the full project

Full Jupyter Notebook and SQL schema available on GitHub.

GitHub Repo ↗ All Projects

Forecasting & Classifying Air Quality in NYC

Explore the full project

Forecasting & Classifying
Air Quality in NYC