R · NLP · Supervised Learning · CUNY DATA 607

Document Classification —
Spam Email Detection with SVM & Logistic Regression

A supervised text classification pipeline that trains on 3,879 labeled emails from the SpamAssassin corpus and validates on 50 real-world Gmail spam messages. TF-IDF features (max 2,000 tokens) with 5-fold cross-validation power two models — Linear SVM and Logistic Regression — with SVM achieving 99.0% holdout accuracy and significantly stronger real-world generalization than Logistic Regression.

R tidymodels TF-IDF Linear SVM Logistic Regression SpamAssassin NLP CUNY DATA 607

Language

R · Quarto

Training Data

SpamAssassin corpus · 3,879 emails

Test Data

Personal Gmail MBOX · 50 real spam

Models

Linear SVM (LiblineaR) · Logistic Regression (glmnet)

Best Result

SVM: 99.0% holdout · 92% real-world

Course

DATA 607 · CUNY SPS · Project 4

Links

GitHub Repo ↗ RPubs Report ↗

01About

The core research question is whether a machine learning model trained on labeled historical email data can accurately classify new, unseen emails as spam or ham based purely on text content — without any rule-based keyword lists or manual filters.

This follows a fully supervised learning approach: labeled SpamAssassin emails build the classifier; the model is then deployed on 50 personal Gmail spam messages it has never seen — a realistic out-of-distribution evaluation that goes beyond the typical held-out split.

The central challenge is highly unstructured email data. Emails contain mixed HTML, quoted-printable encoded characters, system headers, forwarding artifacts, mailing list noise, and base64 blobs — all of which must be stripped without destroying the content words that carry classification signal.

A second challenge is pipeline consistency across two differently formatted datasets: folder-based .txt files (training) vs. a single MBOX file (test). Both require separate but comparable preprocessing pipelines producing identically structured TF-IDF feature matrices before any model sees the data.

02Dataset

Training data comes from the SpamAssassin public corpus — a widely used benchmark for spam classification research. Each email is stored as a raw .txt file in either a spam_2 or easy_ham directory. Test data is a personal MBOX export of 50 Gmail spam emails, providing a challenging real-world validation set with different noise characteristics from the training corpus.

Training — Spam

SpamAssassin spam_2

1,391

Raw spam emails from the SpamAssassin corpus. Labeled as spam. Contains promotional messages, phishing attempts, and automated bulk email with heavy HTML encoding and obfuscation techniques.

Training — Ham

SpamAssassin easy_ham

2,488

Legitimate emails labeled as ham. Includes plain-text mailing list messages, personal correspondence, and technical discussion threads. Contains reply-quoting artifacts and list metadata noise.

Test — Real-World Spam

Personal Gmail MBOX

Real Gmail spam emails exported as a single MBOX file. All assumed spam (ground truth = spam). Contains quoted-printable encoded content, Google system headers, and modern commercial spam patterns.

Split & Validation

80 / 20 Stratified + 5-Fold CV

3,879

Training corpus split 80/20 stratified by label (initial_split). 5-fold cross-validation (vfold_cv) on the training split for robust hyperparameter tuning before final holdout evaluation.

03Pipeline

Raw emails require extensive preprocessing before any model can learn from them. Two separate pipelines are implemented — one for the folder-based training corpus and one for the MBOX test file — both producing identically structured cleaned text columns. The MBOX pipeline adds a quoted-printable decoder (decode_qp_safe()) not needed for the training data.

Load raw emails from disk

Training: list.files() + readLines() per file, UTF-8 encoding. Test: single readLines() on the MBOX file, then split on "^From " delimiters to isolate individual messages.

map_chr(files, ~paste(readLines(.), collapse="\n"))

Extract email body (strip headers)

Each email is split on the first double newline ("\n\n"). Only the second part (body) is retained. Email headers — From, Subject, Content-Type, Received — are discarded entirely.

str_split(x, "\n\n", n=2, simplify=TRUE)[2]

03
Decode quoted-printable encoding (MBOX only)
MBOX emails use quoted-printable encoding (=XX hex sequences, =\r\n soft line breaks). A custom decode_qp_safe() function converts hex sequences to characters using rawToChar(as.raw(strtoi(...))).
str_replace_all(x, "=[0-9A-Fa-f]{2}", decode_fn)

Remove HTML, URLs, and email addresses

Strip all HTML tags (<[^>]+>), URLs (http\S+|www\S+), and email addresses ([[:alnum:]._%+-]+@[[:alnum:].-]+). Replaced with single spaces to preserve word boundaries.

Remove HTML/CSS tokens, mailing list artifacts, reply markers

Lowercase, remove punctuation and digits

str_to_lower(), remove non-ASCII characters, remove digit sequences, remove all punctuation. Each step replaces matched content with a space rather than empty string to preserve word separation.

str_to_lower() %>% str_replace_all("[[:digit:]]+", " ")

07
TF-IDF feature extraction (max 2,000 tokens)
textrecipes pipeline: step_tokenize() → step_stopwords() → step_tokenfilter(max_tokens=2000) → step_tfidf(). Applied identically to training and test sets via the same fitted recipe object.
recipe(label ~ text) %>% step_tokenize() %>% step_tfidf()

Filter short/empty emails

Emails with fewer than 5 whitespace-separated tokens after cleaning are dropped. Removes empty shells, single-word remnants, and emails whose bodies were entirely encoding artifacts.

filter(str_count(text, "\\S+") >= 5)

04Models

Two supervised classification algorithms are tuned, compared via 5-fold cross-validation, and evaluated on a held-out test split before the final external validation on real Gmail spam. Both use the same TF-IDF recipe for feature consistency. Cost and penalty hyperparameters are tuned via tune_grid() and the best configuration selected by accuracy or ROC-AUC respectively.

Model 01 — Winner

Linear SVM

svm_linear(cost=tune()) · LiblineaR engine

A linear Support Vector Machine tuned over 8 cost values (cost(range=c(-3,2))). Selected by highest cross-validation accuracy. Achieves the best balance of precision and recall for text classification — strong on high-dimensional sparse TF-IDF features. Robust to overfitting given the linear kernel and regularisation via cost. Outperforms Logistic Regression on both structured and real-world test data.

★ Best Model — 99.0% holdout · 92% Gmail

Model 02

Logistic Regression

logistic_reg(penalty=tune(), mixture=1) · glmnet

LASSO-regularized logistic regression tuned over 10 penalty values (penalty(range=c(-4,0))). Best model selected by ROC-AUC. Achieves 97.3% holdout accuracy — strong in cross-validation but degrades significantly (~58%) when applied to real Gmail spam, indicating sensitivity to distributional shift between training and real-world data. Provides probability estimates useful for threshold analysis.

05Results

Performance Summary

Holdout Test Set — Linear SVM

99.0% accuracy · 99.6% recall

On the 20% holdout split of the SpamAssassin corpus, Linear SVM achieves accuracy 0.990, recall 0.996, F1 0.992 — extremely high precision and near-perfect recall. Almost no spam slips through and almost no ham is mislabeled.

Real-World Gmail Spam — Both Models

SVM 92% stable · LogReg 58% unstable

Applied to 50 real Gmail spam emails (ground truth = all spam), SVM maintains 92% accuracy demonstrating robustness to distributional shift. Logistic Regression drops to 58% with unstable precision/recall — indicating over-fit to the SpamAssassin format rather than general spam patterns.

Holdout Test Set — Full Metrics Comparison

Model	Accuracy	Precision	Recall	F1-Score	Real-World Acc.
Linear SVM	0.990	0.985	0.996	0.992	0.920
Logistic Regression	0.973	0.968	0.979	0.974	0.580

Cross-Validation Results (5-Fold) — Linear SVM

Metric	CV Mean	Std Err	Interpretation
Accuracy	~0.992	<0.002	Extremely stable across folds
Precision	~0.990	<0.003	Very low false positive rate
Recall	~0.996	<0.002	Near-zero missed spam
F1-Score	~0.993	<0.002	Balanced precision/recall

06Contributions

01 Dual-source preprocessing pipelines — Built two separate but comparable cleaning pipelines for folder-based .txt emails and MBOX format, including a custom decode_qp_safe() quoted-printable decoder that handles hex sequences vector-safely using rawToChar(as.raw(strtoi(...))).
02 TF-IDF feature engineering with textrecipes — Implemented a full tidymodels recipe: tokenization, stopword removal, token frequency filtering (max 2,000), and TF-IDF weighting — fitted on training data and applied consistently to both holdout and real-world test sets.
03 Hyperparameter tuning via 5-fold CV — Tuned Linear SVM over 8 cost values and Logistic Regression over 10 penalty values using tune_grid() with stratified 5-fold cross-validation. Selected best configurations by accuracy (SVM) and ROC-AUC (LogReg).
04 Real-world external validation — Evaluated both models on 50 personal Gmail spam emails as a true out-of-distribution test — revealing significant generalization gap in Logistic Regression (~58%) vs SVM's stability (~92%), a result not visible in the in-distribution holdout split alone.
05 End-to-end tidymodels workflow — Built complete workflow() objects combining recipe + model spec, enabling clean refit and prediction on new data. All steps documented in Quarto with full reproducibility — published on RPubs.

Language	R · Quarto
Packages	tidyverse · tidymodels · textrecipes · glmnet · LiblineaR · stringr · stringi · tm · stopwords
Training Data	SpamAssassin corpus — spam_2 (1,391) + easy_ham (2,488) = 3,879 emails
Test Data	Personal Gmail MBOX export — 50 real-world spam emails
Features	TF-IDF · max_tokens=2000 · stopwords removed · step_tokenfilter
Models	Linear SVM (LiblineaR) · LASSO Logistic Regression (glmnet)
Validation	80/20 stratified split · 5-fold CV · external real-world MBOX test
Course	DATA 607 · Data Acquisition and Management · CUNY SPS

Read the full analysis

Full Quarto report on RPubs · source code and datasets on GitHub.

GitHub Repo ↗ RPubs Report ↗ All Projects

Document Classification — Spam Email Detection with SVM & Logistic Regression

Read the full analysis

Document Classification —
Spam Email Detection with SVM & Logistic Regression