R · NLP · Supervised Learning · CUNY DATA 607

Document Classification
Spam Email Detection with SVM & Logistic Regression

A supervised text classification pipeline that trains on 3,879 labeled emails from the SpamAssassin corpus and validates on 50 real-world Gmail spam messages. TF-IDF features (max 2,000 tokens) with 5-fold cross-validation power two models — Linear SVM and Logistic Regression — with SVM achieving 99.0% holdout accuracy and significantly stronger real-world generalization than Logistic Regression.

R tidymodels TF-IDF Linear SVM Logistic Regression SpamAssassin NLP CUNY DATA 607
Language
R · Quarto
Training Data
SpamAssassin corpus · 3,879 emails
Test Data
Personal Gmail MBOX · 50 real spam
Models
Linear SVM (LiblineaR) · Logistic Regression (glmnet)
Best Result
SVM: 99.0% holdout · 92% real-world
Course
DATA 607 · CUNY SPS · Project 4
01About

The core research question is whether a machine learning model trained on labeled historical email data can accurately classify new, unseen emails as spam or ham based purely on text content — without any rule-based keyword lists or manual filters.

This follows a fully supervised learning approach: labeled SpamAssassin emails build the classifier; the model is then deployed on 50 personal Gmail spam messages it has never seen — a realistic out-of-distribution evaluation that goes beyond the typical held-out split.

The central challenge is highly unstructured email data. Emails contain mixed HTML, quoted-printable encoded characters, system headers, forwarding artifacts, mailing list noise, and base64 blobs — all of which must be stripped without destroying the content words that carry classification signal.

A second challenge is pipeline consistency across two differently formatted datasets: folder-based .txt files (training) vs. a single MBOX file (test). Both require separate but comparable preprocessing pipelines producing identically structured TF-IDF feature matrices before any model sees the data.

02Dataset

Training data comes from the SpamAssassin public corpus — a widely used benchmark for spam classification research. Each email is stored as a raw .txt file in either a spam_2 or easy_ham directory. Test data is a personal MBOX export of 50 Gmail spam emails, providing a challenging real-world validation set with different noise characteristics from the training corpus.

Training — Spam
SpamAssassin spam_2
1,391

Raw spam emails from the SpamAssassin corpus. Labeled as spam. Contains promotional messages, phishing attempts, and automated bulk email with heavy HTML encoding and obfuscation techniques.

Training — Ham
SpamAssassin easy_ham
2,488

Legitimate emails labeled as ham. Includes plain-text mailing list messages, personal correspondence, and technical discussion threads. Contains reply-quoting artifacts and list metadata noise.

Test — Real-World Spam
Personal Gmail MBOX
50

Real Gmail spam emails exported as a single MBOX file. All assumed spam (ground truth = spam). Contains quoted-printable encoded content, Google system headers, and modern commercial spam patterns.

Split & Validation
80 / 20 Stratified + 5-Fold CV
3,879

Training corpus split 80/20 stratified by label (initial_split). 5-fold cross-validation (vfold_cv) on the training split for robust hyperparameter tuning before final holdout evaluation.

03Pipeline

Raw emails require extensive preprocessing before any model can learn from them. Two separate pipelines are implemented — one for the folder-based training corpus and one for the MBOX test file — both producing identically structured cleaned text columns. The MBOX pipeline adds a quoted-printable decoder (decode_qp_safe()) not needed for the training data.

01
Load raw emails from disk
Training: list.files() + readLines() per file, UTF-8 encoding. Test: single readLines() on the MBOX file, then split on "^From " delimiters to isolate individual messages.
map_chr(files, ~paste(readLines(.), collapse="\n"))
02
Extract email body (strip headers)
Each email is split on the first double newline ("\n\n"). Only the second part (body) is retained. Email headers — From, Subject, Content-Type, Received — are discarded entirely.
str_split(x, "\n\n", n=2, simplify=TRUE)[2]
03
Decode quoted-printable encoding (MBOX only)
MBOX emails use quoted-printable encoding (=XX hex sequences, =\r\n soft line breaks). A custom decode_qp_safe() function converts hex sequences to characters using rawToChar(as.raw(strtoi(...))).
str_replace_all(x, "=[0-9A-Fa-f]{2}", decode_fn)
04
Remove HTML, URLs, and email addresses
Strip all HTML tags (<[^>]+>), URLs (http\S+|www\S+), and email addresses ([[:alnum:]._%+-]+@[[:alnum:].-]+). Replaced with single spaces to preserve word boundaries.
05
Remove HTML/CSS tokens, mailing list artifacts, reply markers
Strip specific HTML keywords (nbsp|href|font|table|td|tr|img|meta|html|style|align), encoded segments (==.*?==), and reply-quoting markers (^>+\s*) that are heavy in ham emails.
06
Lowercase, remove punctuation and digits
str_to_lower(), remove non-ASCII characters, remove digit sequences, remove all punctuation. Each step replaces matched content with a space rather than empty string to preserve word separation.
str_to_lower() %>% str_replace_all("[[:digit:]]+", " ")
07
TF-IDF feature extraction (max 2,000 tokens)
textrecipes pipeline: step_tokenize()step_stopwords()step_tokenfilter(max_tokens=2000)step_tfidf(). Applied identically to training and test sets via the same fitted recipe object.
recipe(label ~ text) %>% step_tokenize() %>% step_tfidf()
08
Filter short/empty emails
Emails with fewer than 5 whitespace-separated tokens after cleaning are dropped. Removes empty shells, single-word remnants, and emails whose bodies were entirely encoding artifacts.
filter(str_count(text, "\\S+") >= 5)
04Models

Two supervised classification algorithms are tuned, compared via 5-fold cross-validation, and evaluated on a held-out test split before the final external validation on real Gmail spam. Both use the same TF-IDF recipe for feature consistency. Cost and penalty hyperparameters are tuned via tune_grid() and the best configuration selected by accuracy or ROC-AUC respectively.

Model 01 — Winner
Linear SVM
svm_linear(cost=tune()) · LiblineaR engine

A linear Support Vector Machine tuned over 8 cost values (cost(range=c(-3,2))). Selected by highest cross-validation accuracy. Achieves the best balance of precision and recall for text classification — strong on high-dimensional sparse TF-IDF features. Robust to overfitting given the linear kernel and regularisation via cost. Outperforms Logistic Regression on both structured and real-world test data.

★ Best Model — 99.0% holdout · 92% Gmail
Model 02
Logistic Regression
logistic_reg(penalty=tune(), mixture=1) · glmnet

LASSO-regularized logistic regression tuned over 10 penalty values (penalty(range=c(-4,0))). Best model selected by ROC-AUC. Achieves 97.3% holdout accuracy — strong in cross-validation but degrades significantly (~58%) when applied to real Gmail spam, indicating sensitivity to distributional shift between training and real-world data. Provides probability estimates useful for threshold analysis.

05Results
Performance Summary
Holdout Test Set — Linear SVM
99.0% accuracy · 99.6% recall

On the 20% holdout split of the SpamAssassin corpus, Linear SVM achieves accuracy 0.990, recall 0.996, F1 0.992 — extremely high precision and near-perfect recall. Almost no spam slips through and almost no ham is mislabeled.

Real-World Gmail Spam — Both Models
SVM 92% stable · LogReg 58% unstable

Applied to 50 real Gmail spam emails (ground truth = all spam), SVM maintains 92% accuracy demonstrating robustness to distributional shift. Logistic Regression drops to 58% with unstable precision/recall — indicating over-fit to the SpamAssassin format rather than general spam patterns.

Holdout Test Set — Full Metrics Comparison
ModelAccuracyPrecisionRecallF1-ScoreReal-World Acc.
Linear SVM 0.9900.985 0.9960.992 0.920
Logistic Regression 0.9730.968 0.9790.974 0.580
Cross-Validation Results (5-Fold) — Linear SVM
MetricCV MeanStd ErrInterpretation
Accuracy~0.992<0.002Extremely stable across folds
Precision~0.990<0.003Very low false positive rate
Recall~0.996<0.002Near-zero missed spam
F1-Score~0.993<0.002Balanced precision/recall
06Contributions
LanguageR · Quarto
Packagestidyverse · tidymodels · textrecipes · glmnet · LiblineaR · stringr · stringi · tm · stopwords
Training DataSpamAssassin corpus — spam_2 (1,391) + easy_ham (2,488) = 3,879 emails
Test DataPersonal Gmail MBOX export — 50 real-world spam emails
FeaturesTF-IDF · max_tokens=2000 · stopwords removed · step_tokenfilter
ModelsLinear SVM (LiblineaR) · LASSO Logistic Regression (glmnet)
Validation80/20 stratified split · 5-fold CV · external real-world MBOX test
CourseDATA 607 · Data Acquisition and Management · CUNY SPS

Read the full analysis

Full Quarto report on RPubs · source code and datasets on GitHub.