A data-driven analysis of 45 years of EPA air quality data across all five NYC boroughs (1980–2025). Using an ELT pipeline through MySQL, LSTM deep learning for time series forecasting, and five classification algorithms to categorize AQI levels — connecting environmental data science to public health insight.
As a data analyst at the New York State Department of Health, environmental air quality is directly relevant to the public health work I do daily. This project grew from that context — I wanted to apply data science rigorously to a domain with real public health stakes, using 45 years of EPA-published AQI data across the Bronx, Kings, New York, Queens, and Richmond counties.
The full data science lifecycle is covered: ELT pipeline through MySQL
to consolidate 45 years of individual CSV files into a queryable database, Python-based
loading and filtering via mysql.connector, followed by cleaning, feature
engineering, EDA, and modeling across two distinct problem types — forecasting
and classification.
Three research questions structure the work. The first characterizes long-term AQI trends and seasonal patterns — including the 2023 spike above 200 linked to Canadian wildfires and the overall downward trend since the 1980s. The second asks whether LSTM deep learning can forecast future AQI values — it can, with decent accuracy for an inherently noisy environmental signal. The third compares five classification algorithms for predicting AQI category without using the AQI value itself — testing whether models can learn the structural patterns that produce each category.
The dataset contains 77,083 instances with no null values, spanning five counties and covering both routine pollution days and extreme events. The mean AQI of 51.8 confirms NYC has operated at "Moderate" air quality on average across the full period.
Three questions structure the analysis — one exploratory, one forecasting, one classification. Each is addressed with appropriate methodology and evaluated against standard performance metrics for its problem type.
| Classifier | F1 Score | Accuracy | Notes |
|---|---|---|---|
| LightGBM | 0.59 | 0.58 | Best overall — precision/recall balance |
| KNN | 0.57 | 0.55 | Strong non-linear pattern capture |
| Random Forest | 0.57 | 0.53 | Comparable to KNN; ensemble robustness |
| SVC | 0.42 | 0.37 | Struggled with class imbalance |
| Logistic Regression | 0.33 | 0.35 | Underperformed on nonlinear relationships |
mysql.connector, practicing real
database-driven data acquisition at scale.
| Language | Python 3.13.5 · SQL |
| Environment | Jupyter Notebook 7.3.2 · MySQL Workbench 8.0 |
| Data | pandas · mysql.connector · numpy |
| EDA / Viz | matplotlib · seaborn |
| Deep Learning | Keras (LSTM) · TensorFlow · MinMaxScaler |
| ML | scikit-learn · LightGBM · GridSearchCV · TimeSeriesSplit |
| Dataset | EPA AQI Daily by County · 77,083 records · 5 NYC counties · 1980–2025 |
Full Jupyter Notebook and SQL schema available on GitHub.