A two-part tidy text sentiment analysis: first reproducing the canonical workflow from Text Mining with R (Silge & Robinson) on Jane Austen's six novels, then extending it to ~400 real-world news articles collected via NewsAPI across four categories — politics, technology, business, and sports. Four sentiment lexicons are compared: Bing, AFINN, NRC, and Loughran–McDonald, revealing how corpus type and lexicon choice fundamentally shape sentiment results.
This project explores word-level sentiment analysis using tidy text principles — where text is decomposed into individual word tokens and sentiment is estimated by joining those tokens against pre-built sentiment lexicons. The approach follows the method established in Text Mining with R: A Tidy Approach (Silge & Robinson, 2017), treating overall sentiment as an aggregate of word-level contributions.
The core question driving the extension is: how well does a sentiment methodology developed for literary prose transfer to real-world news text? By applying identical pipelines to both corpora, the analysis isolates the effect of corpus type and lexicon choice on sentiment results.
The project is structured in two parts. Part 1 establishes a baseline using Jane Austen's six novels as a clean, well-structured literary corpus. Part 2 applies the same methodology — and adds the Loughran–McDonald lexicon — to ~400 news articles collected from the NewsAPI across four topic categories.
A key methodological contribution is the use of custom domain stop words (e.g., removing the word "trump" from political news analysis) to reduce named-entity bias in sentiment scoring — a preprocessing concern that doesn't arise in literary text but is critical for news-domain analysis.
6 novels · Bing, AFINN, NRC lexicons · 80-line sentiment chunks · lexicon comparison on Pride & Prejudice
~400 articles · 4 categories · 4 lexicons including Loughran-McDonald · custom stop word removal · per-article sentiment
Four sentiment lexicons are applied across the two corpora. Each encodes a different theory of sentiment: binary polarity (Bing), numeric scoring (AFINN), multi-dimensional emotion (NRC), and domain-adapted financial/news negativity (Loughran–McDonald). Comparing lexicons on the same text reveals how the choice of sentiment dictionary shapes the conclusions drawn.
Binary classification of words as positive or negative.
Applied in Part 1 across all 6 novels (80-line chunks) and in Part 2
per article. Produces sharp net sentiment (positive − negative) ideal
for tracking narrative arcs. Used via get_sentiments("bing").
Assigns integer scores from −5 (most negative) to +5
(most positive) to 2,477 words. Sentiment is computed as
sum(value) per chunk. Shows greater amplitude than Bing
and NRC because of its continuous scoring — most pronounced
variation seen in political news.
Maps words to 10 emotion categories: anger, anticipation, disgust, fear, joy, sadness, surprise, trust, positive, and negative. In Part 1, joy words in Emma are extracted and counted. In Part 2, positive/negative subsets are used for lexicon comparison alongside Bing and AFINN.
Specifically designed for financial and formal text, where words like "liability" or "risk" carry negative connotations distinct from everyday usage. Applied only in Part 2 (news corpus) where its domain focus is appropriate. Consistently produces more negative sentiment scores than the other three lexicons on political news.
The tidy text workflow treats every word as an observation (one row per token)
and uses standard dplyr verbs — inner_join(),
count(), group_by(), pivot_wider()
— to compute sentiment at any level of aggregation. The same steps apply
to both corpora; the NewsAPI pipeline adds a news collection step
and domain-specific stop word cleaning.
austen_books() loads 6 novels. Line numbers and chapter markers are added via mutate(linenumber=row_number(), chapter=cumsum(str_detect(...))). Part 2: NewsAPI /v2/everything endpoint queried for 4 categories (politics, technology, business, sports), 100 articles each. Title + description + content concatenated into a single text field.unnest_tokens(word, text) converts each text into one word per row — the core tidy text operation. This produces a long-format tibble with one observation per (document, word) pair, enabling all subsequent joins and aggregations.anti_join(stop_words) removes common English stop words. For the news corpus, a custom stop word list is added using bind_rows(tibble(word=c("trump"), lexicon="custom"), stop_words) to prevent named entities from skewing political sentiment scores.inner_join(get_sentiments("bing"|"afinn"|"nrc"|"loughran")) matches each token against the chosen lexicon, keeping only words that appear in both. Words not in the lexicon are dropped — a key limitation of dictionary-based approaches on domain-specific text.index = linenumber %/% 80. Part 2: aggregated by article_id. In both cases, pivot_wider() reshapes positive/negative counts into columns, and sentiment = positive − negative computes the net score.ggplot2::geom_col() with facet_wrap(~method) to compare AFINN, Bing, NRC, and Loughran–McDonald side by side on the same political news articles. Top contributing words extracted via count(word, sentiment, sort=TRUE) + slice_max(n, n=10).All six novels produce coherent positive-leaning sentiment arcs across 80-line chunks, with clear narrative highs and lows. Structured narrative prose and stable vocabulary make lexicon matching reliable and results interpretable across all three lexicons (Bing, AFINN, NRC).
News articles produce highly variable sentiment with large swings between articles, even within the same category. Mixed tonal register (factual reporting + opinion), domain-specific vocabulary, and named entities (political figures) all reduce lexicon match rate and reliability compared to literary text.
| Lexicon | Scale | Behaviour on Political News | Key Observation |
|---|---|---|---|
| AFINN | −5 to +5 | Highest amplitude swings | Continuous scores amplify both positives and negatives |
| Bing | Pos / Neg | Sharp but moderate swings | Clean binary signal; most widely applicable |
| NRC | 10 emotions | Similar pattern to Bing | Using positive/negative subsets mirrors Bing behaviour |
| Loughran-McDonald | Pos / Neg | Consistently more negative | Domain-specific negatives (liability, risk) inflate negativity |
Political news consistently skews negative across all four lexicons. Top negative words include: crisis, threat, fail, attack. Named entity bias (political figures) required custom stop word removal to reduce distortion.
Sports articles produce the strongest positive sentiment signal. Words like win, champion, lead, achievement dominate. Results are consistent across Bing, AFINN, and NRC, suggesting sports vocabulary maps well onto general-purpose lexicons.
Technology articles show mild positive sentiment, with innovation-related words boosting scores. However, content about security, risk, and regulation pulls sentiment toward neutral or slightly negative in some articles.
Business articles show the greatest disagreement between lexicons — particularly between Loughran-McDonald (strongly negative due to financial terminology) and Bing/AFINN (closer to neutral), illustrating how domain matters for lexicon selection.
unnest_tokens(), stop word removal, lexicon joins, 80-line chunking,
and sentiment visualization across all 6 Jane Austen novels and three lexicons.
get_news(query) function using httr and jsonlite
to retrieve 100 full articles per category from /v2/everything.
Combined title, description, and content into a single unified text field for
richer sentiment signals than headline-only approaches.
stop_words list with a custom tibble
(tibble(word=c("trump"), lexicon="custom")) to remove political
named entities that distort sentiment scoring — demonstrating the importance
of corpus-aware preprocessing in NLP pipelines.
| Language | R · Quarto |
| Packages | tidytext · janeaustenr · dplyr · tidyr · stringr · ggplot2 · httr · jsonlite · purrr · tibble |
| Corpus 1 | Jane Austen — 6 novels via janeaustenr package |
| Corpus 2 | NewsAPI /v2/everything · politics, technology, business, sports · ~100 articles/category |
| Lexicons | Bing (Hu & Liu 2004) · AFINN (Nielsen 2011) · NRC (Mohammad & Turney 2013) · Loughran-McDonald (2011) |
| Chunking | 80-line sections (literary) · per-article aggregation (news) |
| Stop words | tidytext::stop_words + custom domain stop words (named entities) |
| Course | DATA 607 · Data Acquisition and Management · CUNY SPS |
Full Quarto report on RPubs · source code and data on GitHub.