A supervised text classification pipeline that trains on 3,879 labeled emails from the SpamAssassin corpus and validates on 50 real-world Gmail spam messages. TF-IDF features (max 2,000 tokens) with 5-fold cross-validation power two models — Linear SVM and Logistic Regression — with SVM achieving 99.0% holdout accuracy and significantly stronger real-world generalization than Logistic Regression.
The core research question is whether a machine learning model trained on labeled historical email data can accurately classify new, unseen emails as spam or ham based purely on text content — without any rule-based keyword lists or manual filters.
This follows a fully supervised learning approach: labeled SpamAssassin emails build the classifier; the model is then deployed on 50 personal Gmail spam messages it has never seen — a realistic out-of-distribution evaluation that goes beyond the typical held-out split.
The central challenge is highly unstructured email data. Emails contain mixed HTML, quoted-printable encoded characters, system headers, forwarding artifacts, mailing list noise, and base64 blobs — all of which must be stripped without destroying the content words that carry classification signal.
A second challenge is pipeline consistency across two differently formatted datasets: folder-based .txt files (training) vs. a single MBOX file (test). Both require separate but comparable preprocessing pipelines producing identically structured TF-IDF feature matrices before any model sees the data.
Training data comes from the SpamAssassin public corpus —
a widely used benchmark for spam classification research. Each email is stored
as a raw .txt file in either a spam_2 or easy_ham
directory. Test data is a personal MBOX export of 50 Gmail spam emails,
providing a challenging real-world validation set with different noise
characteristics from the training corpus.
Raw spam emails from the SpamAssassin corpus. Labeled as spam. Contains promotional messages, phishing attempts, and automated bulk email with heavy HTML encoding and obfuscation techniques.
Legitimate emails labeled as ham. Includes plain-text mailing list messages, personal correspondence, and technical discussion threads. Contains reply-quoting artifacts and list metadata noise.
Real Gmail spam emails exported as a single MBOX file. All assumed spam (ground truth = spam). Contains quoted-printable encoded content, Google system headers, and modern commercial spam patterns.
Training corpus split 80/20 stratified by label
(initial_split). 5-fold cross-validation
(vfold_cv) on the training split for robust
hyperparameter tuning before final holdout evaluation.
Raw emails require extensive preprocessing before any model can learn from them.
Two separate pipelines are implemented — one for the folder-based training corpus
and one for the MBOX test file — both producing identically structured cleaned
text columns. The MBOX pipeline adds a quoted-printable decoder
(decode_qp_safe()) not needed for the training data.
list.files() + readLines() per file, UTF-8 encoding. Test: single readLines() on the MBOX file, then split on "^From " delimiters to isolate individual messages."\n\n"). Only the second part (body) is retained. Email headers — From, Subject, Content-Type, Received — are discarded entirely.=XX hex sequences, =\r\n soft line breaks). A custom decode_qp_safe() function converts hex sequences to characters using rawToChar(as.raw(strtoi(...))).<[^>]+>), URLs (http\S+|www\S+), and email addresses ([[:alnum:]._%+-]+@[[:alnum:].-]+). Replaced with single spaces to preserve word boundaries.nbsp|href|font|table|td|tr|img|meta|html|style|align), encoded segments (==.*?==), and reply-quoting markers (^>+\s*) that are heavy in ham emails.str_to_lower(), remove non-ASCII characters, remove digit sequences, remove all punctuation. Each step replaces matched content with a space rather than empty string to preserve word separation.step_tokenize() → step_stopwords() → step_tokenfilter(max_tokens=2000) → step_tfidf(). Applied identically to training and test sets via the same fitted recipe object.
Two supervised classification algorithms are tuned, compared via
5-fold cross-validation, and evaluated on a held-out test split before
the final external validation on real Gmail spam. Both use the same
TF-IDF recipe for feature consistency. Cost and penalty hyperparameters
are tuned via tune_grid() and the best configuration selected
by accuracy or ROC-AUC respectively.
A linear Support Vector Machine tuned over 8 cost values
(cost(range=c(-3,2))). Selected by highest cross-validation
accuracy. Achieves the best balance of precision and recall for text
classification — strong on high-dimensional sparse TF-IDF features.
Robust to overfitting given the linear kernel and regularisation via cost.
Outperforms Logistic Regression on both structured and real-world test data.
LASSO-regularized logistic regression tuned over 10 penalty values
(penalty(range=c(-4,0))). Best model selected by ROC-AUC.
Achieves 97.3% holdout accuracy — strong in cross-validation but
degrades significantly (~58%) when applied to real Gmail spam, indicating
sensitivity to distributional shift between training and real-world data.
Provides probability estimates useful for threshold analysis.
On the 20% holdout split of the SpamAssassin corpus, Linear SVM achieves accuracy 0.990, recall 0.996, F1 0.992 — extremely high precision and near-perfect recall. Almost no spam slips through and almost no ham is mislabeled.
Applied to 50 real Gmail spam emails (ground truth = all spam), SVM maintains 92% accuracy demonstrating robustness to distributional shift. Logistic Regression drops to 58% with unstable precision/recall — indicating over-fit to the SpamAssassin format rather than general spam patterns.
| Model | Accuracy | Precision | Recall | F1-Score | Real-World Acc. |
|---|---|---|---|---|---|
| Linear SVM | 0.990 | 0.985 | 0.996 | 0.992 | 0.920 |
| Logistic Regression | 0.973 | 0.968 | 0.979 | 0.974 | 0.580 |
| Metric | CV Mean | Std Err | Interpretation |
|---|---|---|---|
| Accuracy | ~0.992 | <0.002 | Extremely stable across folds |
| Precision | ~0.990 | <0.003 | Very low false positive rate |
| Recall | ~0.996 | <0.002 | Near-zero missed spam |
| F1-Score | ~0.993 | <0.002 | Balanced precision/recall |
decode_qp_safe() quoted-printable decoder that
handles hex sequences vector-safely using rawToChar(as.raw(strtoi(...))).
tune_grid() with stratified 5-fold cross-validation. Selected
best configurations by accuracy (SVM) and ROC-AUC (LogReg).
workflow() objects combining recipe + model spec, enabling
clean refit and prediction on new data. All steps documented in Quarto
with full reproducibility — published on RPubs.
| Language | R · Quarto |
| Packages | tidyverse · tidymodels · textrecipes · glmnet · LiblineaR · stringr · stringi · tm · stopwords |
| Training Data | SpamAssassin corpus — spam_2 (1,391) + easy_ham (2,488) = 3,879 emails |
| Test Data | Personal Gmail MBOX export — 50 real-world spam emails |
| Features | TF-IDF · max_tokens=2000 · stopwords removed · step_tokenfilter |
| Models | Linear SVM (LiblineaR) · LASSO Logistic Regression (glmnet) |
| Validation | 80/20 stratified split · 5-fold CV · external real-world MBOX test |
| Course | DATA 607 · Data Acquisition and Management · CUNY SPS |
Full Quarto report on RPubs · source code and datasets on GitHub.