R · NLP · Tidy Text · CUNY DATA 607

Sentiment Analysis
Literary Text & Real-World News

A two-part tidy text sentiment analysis: first reproducing the canonical workflow from Text Mining with R (Silge & Robinson) on Jane Austen's six novels, then extending it to ~400 real-world news articles collected via NewsAPI across four categories — politics, technology, business, and sports. Four sentiment lexicons are compared: Bing, AFINN, NRC, and Loughran–McDonald, revealing how corpus type and lexicon choice fundamentally shape sentiment results.

R tidytext janeaustenr NewsAPI Bing · AFINN · NRC Loughran-McDonald NLP CUNY DATA 607
Language
R · Quarto
Corpus 1
Jane Austen novels · 6 books · janeaustenr
Corpus 2
NewsAPI · ~400 articles · 4 categories
Lexicons
Bing · AFINN · NRC · Loughran-McDonald
Key Finding
Literary text → smooth; News → volatile
Course
DATA 607 · CUNY SPS
01About

This project explores word-level sentiment analysis using tidy text principles — where text is decomposed into individual word tokens and sentiment is estimated by joining those tokens against pre-built sentiment lexicons. The approach follows the method established in Text Mining with R: A Tidy Approach (Silge & Robinson, 2017), treating overall sentiment as an aggregate of word-level contributions.

The core question driving the extension is: how well does a sentiment methodology developed for literary prose transfer to real-world news text? By applying identical pipelines to both corpora, the analysis isolates the effect of corpus type and lexicon choice on sentiment results.

The project is structured in two parts. Part 1 establishes a baseline using Jane Austen's six novels as a clean, well-structured literary corpus. Part 2 applies the same methodology — and adds the Loughran–McDonald lexicon — to ~400 news articles collected from the NewsAPI across four topic categories.

A key methodological contribution is the use of custom domain stop words (e.g., removing the word "trump" from political news analysis) to reduce named-entity bias in sentiment scoring — a preprocessing concern that doesn't arise in literary text but is critical for news-domain analysis.

Part 1
Jane Austen Corpus

6 novels · Bing, AFINN, NRC lexicons · 80-line sentiment chunks · lexicon comparison on Pride & Prejudice

Part 2
NewsAPI Extension

~400 articles · 4 categories · 4 lexicons including Loughran-McDonald · custom stop word removal · per-article sentiment

02Lexicons

Four sentiment lexicons are applied across the two corpora. Each encodes a different theory of sentiment: binary polarity (Bing), numeric scoring (AFINN), multi-dimensional emotion (NRC), and domain-adapted financial/news negativity (Loughran–McDonald). Comparing lexicons on the same text reveals how the choice of sentiment dictionary shapes the conclusions drawn.

Lexicon 01
Bing
Hu & Liu (2004)
positive / negative

Binary classification of words as positive or negative. Applied in Part 1 across all 6 novels (80-line chunks) and in Part 2 per article. Produces sharp net sentiment (positive − negative) ideal for tracking narrative arcs. Used via get_sentiments("bing").

Lexicon 02
AFINN
Nielsen (2011)
−5 to +5 numeric score

Assigns integer scores from −5 (most negative) to +5 (most positive) to 2,477 words. Sentiment is computed as sum(value) per chunk. Shows greater amplitude than Bing and NRC because of its continuous scoring — most pronounced variation seen in political news.

Lexicon 03
NRC
Mohammad & Turney (2013)
10 emotion categories

Maps words to 10 emotion categories: anger, anticipation, disgust, fear, joy, sadness, surprise, trust, positive, and negative. In Part 1, joy words in Emma are extracted and counted. In Part 2, positive/negative subsets are used for lexicon comparison alongside Bing and AFINN.

Lexicon 04
Loughran–McDonald
Loughran & McDonald (2011)
News / financial domain

Specifically designed for financial and formal text, where words like "liability" or "risk" carry negative connotations distinct from everyday usage. Applied only in Part 2 (news corpus) where its domain focus is appropriate. Consistently produces more negative sentiment scores than the other three lexicons on political news.

03Pipeline

The tidy text workflow treats every word as an observation (one row per token) and uses standard dplyr verbs — inner_join(), count(), group_by(), pivot_wider() — to compute sentiment at any level of aggregation. The same steps apply to both corpora; the NewsAPI pipeline adds a news collection step and domain-specific stop word cleaning.

01
Load and structure text corpus
Part 1: austen_books() loads 6 novels. Line numbers and chapter markers are added via mutate(linenumber=row_number(), chapter=cumsum(str_detect(...))). Part 2: NewsAPI /v2/everything endpoint queried for 4 categories (politics, technology, business, sports), 100 articles each. Title + description + content concatenated into a single text field.
GET /v2/everything?q={query}&pageSize=100&language=en
02
Tokenize into tidy word format
unnest_tokens(word, text) converts each text into one word per row — the core tidy text operation. This produces a long-format tibble with one observation per (document, word) pair, enabling all subsequent joins and aggregations.
tidy_books <- austen_books() %>% unnest_tokens(word, text)
03
Remove stop words (and domain-specific terms)
anti_join(stop_words) removes common English stop words. For the news corpus, a custom stop word list is added using bind_rows(tibble(word=c("trump"), lexicon="custom"), stop_words) to prevent named entities from skewing political sentiment scores.
anti_join(custom_stop_words, by="word")
04
Join with sentiment lexicons
inner_join(get_sentiments("bing"|"afinn"|"nrc"|"loughran")) matches each token against the chosen lexicon, keeping only words that appear in both. Words not in the lexicon are dropped — a key limitation of dictionary-based approaches on domain-specific text.
inner_join(get_sentiments("bing")) %>% inner_join(get_sentiments("afinn"))
05
Aggregate sentiment by chunk / article
Part 1: texts chunked into 80-line sections using index = linenumber %/% 80. Part 2: aggregated by article_id. In both cases, pivot_wider() reshapes positive/negative counts into columns, and sentiment = positive − negative computes the net score.
mutate(sentiment = positive - negative)
06
Visualize and compare across lexicons
Sentiment is plotted using ggplot2::geom_col() with facet_wrap(~method) to compare AFINN, Bing, NRC, and Loughran–McDonald side by side on the same political news articles. Top contributing words extracted via count(word, sentiment, sort=TRUE) + slice_max(n, n=10).
facet_wrap(~method, ncol=1, scales="free_y")
04Key Findings
Literary vs News Corpus
Jane Austen Novels — Literary Corpus
Smooth & Consistent Sentiment

All six novels produce coherent positive-leaning sentiment arcs across 80-line chunks, with clear narrative highs and lows. Structured narrative prose and stable vocabulary make lexicon matching reliable and results interpretable across all three lexicons (Bing, AFINN, NRC).

NewsAPI Articles — Real-World Corpus
Volatile & Inconsistent Sentiment

News articles produce highly variable sentiment with large swings between articles, even within the same category. Mixed tonal register (factual reporting + opinion), domain-specific vocabulary, and named entities (political figures) all reduce lexicon match rate and reliability compared to literary text.

Lexicon Comparison — Political News (after custom stop word removal)
Lexicon Scale Behaviour on Political News Key Observation
AFINN −5 to +5 Highest amplitude swings Continuous scores amplify both positives and negatives
Bing Pos / Neg Sharp but moderate swings Clean binary signal; most widely applicable
NRC 10 emotions Similar pattern to Bing Using positive/negative subsets mirrors Bing behaviour
Loughran-McDonald Pos / Neg Consistently more negative Domain-specific negatives (liability, risk) inflate negativity
Category-Level Sentiment (Bing, per article)
Politics
Most Negative Category

Political news consistently skews negative across all four lexicons. Top negative words include: crisis, threat, fail, attack. Named entity bias (political figures) required custom stop word removal to reduce distortion.

Sports
Most Positive Category

Sports articles produce the strongest positive sentiment signal. Words like win, champion, lead, achievement dominate. Results are consistent across Bing, AFINN, and NRC, suggesting sports vocabulary maps well onto general-purpose lexicons.

Technology
Moderately Positive

Technology articles show mild positive sentiment, with innovation-related words boosting scores. However, content about security, risk, and regulation pulls sentiment toward neutral or slightly negative in some articles.

Business
Mixed / Slightly Negative

Business articles show the greatest disagreement between lexicons — particularly between Loughran-McDonald (strongly negative due to financial terminology) and Bing/AFINN (closer to neutral), illustrating how domain matters for lexicon selection.

05Contributions
LanguageR · Quarto
Packagestidytext · janeaustenr · dplyr · tidyr · stringr · ggplot2 · httr · jsonlite · purrr · tibble
Corpus 1Jane Austen — 6 novels via janeaustenr package
Corpus 2NewsAPI /v2/everything · politics, technology, business, sports · ~100 articles/category
LexiconsBing (Hu & Liu 2004) · AFINN (Nielsen 2011) · NRC (Mohammad & Turney 2013) · Loughran-McDonald (2011)
Chunking80-line sections (literary) · per-article aggregation (news)
Stop wordstidytext::stop_words + custom domain stop words (named entities)
CourseDATA 607 · Data Acquisition and Management · CUNY SPS

Read the full analysis

Full Quarto report on RPubs · source code and data on GitHub.