R · NLP · Tidy Text · CUNY DATA 607

Sentiment Analysis —
Literary Text & Real-World News

A two-part tidy text sentiment analysis: first reproducing the canonical workflow from Text Mining with R (Silge & Robinson) on Jane Austen's six novels, then extending it to ~400 real-world news articles collected via NewsAPI across four categories — politics, technology, business, and sports. Four sentiment lexicons are compared: Bing, AFINN, NRC, and Loughran–McDonald, revealing how corpus type and lexicon choice fundamentally shape sentiment results.

R tidytext janeaustenr NewsAPI Bing · AFINN · NRC Loughran-McDonald NLP CUNY DATA 607

Language

R · Quarto

Corpus 1

Jane Austen novels · 6 books · janeaustenr

Corpus 2

NewsAPI · ~400 articles · 4 categories

Lexicons

Bing · AFINN · NRC · Loughran-McDonald

Key Finding

Literary text → smooth; News → volatile

Course

DATA 607 · CUNY SPS

Links

GitHub Repo ↗ RPubs Report ↗

01About

This project explores word-level sentiment analysis using tidy text principles — where text is decomposed into individual word tokens and sentiment is estimated by joining those tokens against pre-built sentiment lexicons. The approach follows the method established in Text Mining with R: A Tidy Approach (Silge & Robinson, 2017), treating overall sentiment as an aggregate of word-level contributions.

The core question driving the extension is: how well does a sentiment methodology developed for literary prose transfer to real-world news text? By applying identical pipelines to both corpora, the analysis isolates the effect of corpus type and lexicon choice on sentiment results.

The project is structured in two parts. Part 1 establishes a baseline using Jane Austen's six novels as a clean, well-structured literary corpus. Part 2 applies the same methodology — and adds the Loughran–McDonald lexicon — to ~400 news articles collected from the NewsAPI across four topic categories.

A key methodological contribution is the use of custom domain stop words (e.g., removing the word "trump" from political news analysis) to reduce named-entity bias in sentiment scoring — a preprocessing concern that doesn't arise in literary text but is critical for news-domain analysis.

Part 1

Jane Austen Corpus

6 novels · Bing, AFINN, NRC lexicons · 80-line sentiment chunks · lexicon comparison on Pride & Prejudice

Part 2

NewsAPI Extension

~400 articles · 4 categories · 4 lexicons including Loughran-McDonald · custom stop word removal · per-article sentiment

02Lexicons

Four sentiment lexicons are applied across the two corpora. Each encodes a different theory of sentiment: binary polarity (Bing), numeric scoring (AFINN), multi-dimensional emotion (NRC), and domain-adapted financial/news negativity (Loughran–McDonald). Comparing lexicons on the same text reveals how the choice of sentiment dictionary shapes the conclusions drawn.

Lexicon 01

Bing

Hu & Liu (2004)

positive / negative

Binary classification of words as positive or negative. Applied in Part 1 across all 6 novels (80-line chunks) and in Part 2 per article. Produces sharp net sentiment (positive − negative) ideal for tracking narrative arcs. Used via get_sentiments("bing").

Lexicon 02

AFINN

Nielsen (2011)

−5 to +5 numeric score

Assigns integer scores from −5 (most negative) to +5 (most positive) to 2,477 words. Sentiment is computed as sum(value) per chunk. Shows greater amplitude than Bing and NRC because of its continuous scoring — most pronounced variation seen in political news.

Lexicon 03

NRC

Mohammad & Turney (2013)

10 emotion categories

Maps words to 10 emotion categories: anger, anticipation, disgust, fear, joy, sadness, surprise, trust, positive, and negative. In Part 1, joy words in Emma are extracted and counted. In Part 2, positive/negative subsets are used for lexicon comparison alongside Bing and AFINN.

Lexicon 04

Loughran–McDonald

Loughran & McDonald (2011)

News / financial domain

Specifically designed for financial and formal text, where words like "liability" or "risk" carry negative connotations distinct from everyday usage. Applied only in Part 2 (news corpus) where its domain focus is appropriate. Consistently produces more negative sentiment scores than the other three lexicons on political news.

03Pipeline

The tidy text workflow treats every word as an observation (one row per token) and uses standard dplyr verbs — inner_join(), count(), group_by(), pivot_wider() — to compute sentiment at any level of aggregation. The same steps apply to both corpora; the NewsAPI pipeline adds a news collection step and domain-specific stop word cleaning.

Load and structure text corpus

Part 1: austen_books() loads 6 novels. Line numbers and chapter markers are added via mutate(linenumber=row_number(), chapter=cumsum(str_detect(...))). Part 2: NewsAPI /v2/everything endpoint queried for 4 categories (politics, technology, business, sports), 100 articles each. Title + description + content concatenated into a single text field.

GET /v2/everything?q={query}&pageSize=100&language=en

Tokenize into tidy word format

unnest_tokens(word, text) converts each text into one word per row — the core tidy text operation. This produces a long-format tibble with one observation per (document, word) pair, enabling all subsequent joins and aggregations.

tidy_books <- austen_books() %>% unnest_tokens(word, text)

Remove stop words (and domain-specific terms)

anti_join(stop_words) removes common English stop words. For the news corpus, a custom stop word list is added using bind_rows(tibble(word=c("trump"), lexicon="custom"), stop_words) to prevent named entities from skewing political sentiment scores.

anti_join(custom_stop_words, by="word")

Join with sentiment lexicons

inner_join(get_sentiments("bing"|"afinn"|"nrc"|"loughran")) matches each token against the chosen lexicon, keeping only words that appear in both. Words not in the lexicon are dropped — a key limitation of dictionary-based approaches on domain-specific text.

inner_join(get_sentiments("bing")) %>% inner_join(get_sentiments("afinn"))

Aggregate sentiment by chunk / article

Part 1: texts chunked into 80-line sections using index = linenumber %/% 80. Part 2: aggregated by article_id. In both cases, pivot_wider() reshapes positive/negative counts into columns, and sentiment = positive − negative computes the net score.

mutate(sentiment = positive - negative)

Visualize and compare across lexicons

Sentiment is plotted using ggplot2::geom_col() with facet_wrap(~method) to compare AFINN, Bing, NRC, and Loughran–McDonald side by side on the same political news articles. Top contributing words extracted via count(word, sentiment, sort=TRUE) + slice_max(n, n=10).

facet_wrap(~method, ncol=1, scales="free_y")

04Key Findings

Literary vs News Corpus

Jane Austen Novels — Literary Corpus

Smooth & Consistent Sentiment

All six novels produce coherent positive-leaning sentiment arcs across 80-line chunks, with clear narrative highs and lows. Structured narrative prose and stable vocabulary make lexicon matching reliable and results interpretable across all three lexicons (Bing, AFINN, NRC).

NewsAPI Articles — Real-World Corpus

Volatile & Inconsistent Sentiment

News articles produce highly variable sentiment with large swings between articles, even within the same category. Mixed tonal register (factual reporting + opinion), domain-specific vocabulary, and named entities (political figures) all reduce lexicon match rate and reliability compared to literary text.

Lexicon Comparison — Political News (after custom stop word removal)

Lexicon	Scale	Behaviour on Political News	Key Observation
AFINN	−5 to +5	Highest amplitude swings	Continuous scores amplify both positives and negatives
Bing	Pos / Neg	Sharp but moderate swings	Clean binary signal; most widely applicable
NRC	10 emotions	Similar pattern to Bing	Using positive/negative subsets mirrors Bing behaviour
Loughran-McDonald	Pos / Neg	Consistently more negative	Domain-specific negatives (liability, risk) inflate negativity

Category-Level Sentiment (Bing, per article)

Politics

Most Negative Category

Political news consistently skews negative across all four lexicons. Top negative words include: crisis, threat, fail, attack. Named entity bias (political figures) required custom stop word removal to reduce distortion.

Sports

Most Positive Category

Sports articles produce the strongest positive sentiment signal. Words like win, champion, lead, achievement dominate. Results are consistent across Bing, AFINN, and NRC, suggesting sports vocabulary maps well onto general-purpose lexicons.

Technology

Moderately Positive

Technology articles show mild positive sentiment, with innovation-related words boosting scores. However, content about security, risk, and regulation pulls sentiment toward neutral or slightly negative in some articles.

Business

Mixed / Slightly Negative

Business articles show the greatest disagreement between lexicons — particularly between Loughran-McDonald (strongly negative due to financial terminology) and Bing/AFINN (closer to neutral), illustrating how domain matters for lexicon selection.

05Contributions

01 Reproduced canonical tidy text workflow — Implemented the full Chapter 2 pipeline from Text Mining with R: tokenization via unnest_tokens(), stop word removal, lexicon joins, 80-line chunking, and sentiment visualization across all 6 Jane Austen novels and three lexicons.
02 NewsAPI integration and corpus construction — Built a reusable get_news(query) function using httr and jsonlite to retrieve 100 full articles per category from /v2/everything. Combined title, description, and content into a single unified text field for richer sentiment signals than headline-only approaches.
03 Four-lexicon comparative analysis — Applied Bing, AFINN, NRC, and Loughran–McDonald lexicons to the same political news text, revealing systematic divergence: Loughran–McDonald consistently skews more negative due to domain-specific financial vocabulary — a bias absent in the other three.
04 Domain-specific stop word removal — Extended the standard stop_words list with a custom tibble (tibble(word=c("trump"), lexicon="custom")) to remove political named entities that distort sentiment scoring — demonstrating the importance of corpus-aware preprocessing in NLP pipelines.
05 Cross-corpus comparative insight — Quantified and visualized the fundamental difference between literary and news sentiment: structured narrative prose produces smooth, interpretable sentiment arcs while real-world news yields volatile, lexicon-sensitive results that require careful methodological choices.

Language	R · Quarto
Packages	tidytext · janeaustenr · dplyr · tidyr · stringr · ggplot2 · httr · jsonlite · purrr · tibble
Corpus 1	Jane Austen — 6 novels via janeaustenr package
Corpus 2	NewsAPI /v2/everything · politics, technology, business, sports · ~100 articles/category
Lexicons	Bing (Hu & Liu 2004) · AFINN (Nielsen 2011) · NRC (Mohammad & Turney 2013) · Loughran-McDonald (2011)
Chunking	80-line sections (literary) · per-article aggregation (news)
Stop words	tidytext::stop_words + custom domain stop words (named entities)
Course	DATA 607 · Data Acquisition and Management · CUNY SPS

Read the full analysis

Full Quarto report on RPubs · source code and data on GitHub.

GitHub Repo ↗ RPubs Report ↗ All Projects

Sentiment Analysis — Literary Text & Real-World News

Read the full analysis

Sentiment Analysis —
Literary Text & Real-World News