R · tidyr · Plotly · JHU CSSE · CUNY DATA 607

COVID-19 Global Cases
Tidy Format Transformation & Interactive Visualization

Transforms the Johns Hopkins University CSSE wide-format global COVID-19 confirmed cases dataset into a tidy (long) structure using tidyr and dplyr. Daily cases are computed from cumulative counts via lag() differencing, missing values are handled systematically, and country-level summaries are visualized through an interactive Plotly choropleth map and a USA time series — illustrating how tidy data principles enable cleaner, more reproducible analysis.

R tidyr · dplyr pivot_longer Plotly ggplot2 JHU CSSE Choropleth CUNY DATA 607
Language
R · Quarto
Dataset
JHU CSSE Global COVID-19 — Jan 2020 → Mar 2023
Transform
Wide → Long via pivot_longer() · lag() differencing
Visualization
Plotly choropleth · ggplot2 time series
Course
DATA 607 · CUNY SPS · Project 2
01About

The Johns Hopkins CSSE COVID-19 dataset stores daily confirmed case counts in wide format — one column per date spanning from January 22, 2020 to March 10, 2023. This violates tidy data principles where every variable should have its own column and every observation its own row. With hundreds of date columns, standard analysis functions become impractical.

The core transformation converts this structure into a long format where each row represents a single country-date observation with a date variable and a confirmed_cases value — making the data compatible with the full tidyverse toolchain.

Two key challenges had to be addressed. First, date values stored as column names required parsing with mdy() after reshaping. Second, the dataset stores cumulative counts — not daily cases — requiring a lag() differencing step to extract the actual daily new case figures per country and province.

A third challenge: some countries reported negative daily values due to data corrections and revisions. These were clamped to zero with if_else(daily_cases < 0, 0L, daily_cases) to prevent downstream analysis errors while preserving data integrity.

02Pipeline

The complete data transformation follows a six-step tidy pipeline. Each step uses a single dplyr/tidyr verb, making the transformation reproducible, readable, and easy to audit. The pipeline runs entirely in-memory with no intermediate file writes.

01
Import wide-format CSV from GitHub
Read the JHU CSSE dataset directly from its GitHub raw URL using read_csv(). The dataset contains 289 country/province rows × 1,144 date columns. Columns 1–4 are metadata (Province/State, Country/Region, Lat, Long); columns 5 onward are date-named case count columns.
df <- read_csv(url, show_col_types = FALSE)
02
pivot_longer() — wide to tidy long format
All date columns (5 through last) are collapsed into two new columns: date (the former column name) and cumulative_cases (the value). This converts 289 × 1,144 wide format to ~330,000 rows × 6 columns — one row per country-province-date observation.
pivot_longer(cols = 5:last_col(), names_to = "date", values_to = "cumulative_cases")
03
Parse dates and clean column names
mdy(date) converts date strings like "1/22/20" into proper R Date objects. Column names are standardized, whitespace trimmed with str_trim(), and the dataset reordered by country, province, and date for the subsequent lag computation.
mutate(date = mdy(date), province = str_trim(`Province/State`))
04
Compute daily cases via lag() differencing
Since the source data is cumulative, daily new cases are extracted by subtracting the previous day's value within each country-province group: confirmed_cases = cumulative_cases - lag(cumulative_cases, default = 0). Grouped by (country, province) so the subtraction never crosses country boundaries.
group_by(country, province) %>% mutate(confirmed_cases = cumulative_cases - lag(cumulative_cases, default = 0))
05
Handle negatives and missing values
A second lag pass computes daily_cases and clamps negatives to 0 — eliminating artifacts from JHU data corrections and retroactive deduplication. Missing province values replaced with "None"; missing counts replaced with 0.
mutate(daily_cases = if_else(daily_cases < 0, 0L, daily_cases))
06
Aggregate country-level summaries
group_by(country) %>% summarise() computes three metrics per country: total_cases (sum), max_daily_cases (peak single day), and avg_daily_cases (mean over reporting period). ISO-3 country codes added via countrycode() for Plotly choropleth mapping.
summarise(total_cases = sum(...), avg_daily_cases = round(mean(...)))
03Live Dashboard
Live · Interactive · Plotly.js

Reproduced from the original Quarto report using Plotly.js — the same choropleth and time series generated by the R code. The global map uses the YlOrRd color scale (yellow → orange → red) to show average daily case burden across countries. Hover over any country to see its name, average daily cases, and single-day peak. The USA time series shows all five pandemic waves from Jan 2020 to Mar 2023.

Global COVID-19 Average Daily Cases by Country Jan 2020 – Mar 2023 · JHU CSSE · YlOrRd color scale
Daily COVID-19 Confirmed Cases — United States Jan 2020 – Mar 2023 · JHU CSSE · 5 distinct pandemic waves
04Key Results
Countries Tracked
190+
Nations in JHU dataset
Days of Data
1,143
Jan 22, 2020 – Mar 10, 2023
USA Peak Day
1.02M
Jan 10, 2022 · Omicron wave
Global Total
676M+
Confirmed cases worldwide
  • 01 Wide → Long transformation yields 330,000+ rows — the 289-row × 1,144-column wide dataset pivots into ~330,000 tidy observations, each representing a unique country-province-date combination with a single confirmed_cases value.
  • 02 US dominates global case counts — with ~103M total confirmed cases, the US accounts for the highest per-country total, followed by India (~44.7M), France (~38.9M), Germany (~38.4M), and Brazil (~37.5M). The choropleth reveals that high-income countries in North America and Western Europe show the heaviest case burdens.
  • 03 Five distinct USA pandemic waves — the time series clearly shows: Wave 1 (spring 2020, ~70K peak), Wave 2 (summer 2020), Wave 3 (winter 2020-21, ~300K), Wave 4 (Delta, fall 2021, ~170K), and Wave 5 (Omicron, Jan 2022, ~1.02M peak) — each with different magnitude and duration.
  • 04 Negative daily values require correction — approximately 2–3% of daily case values computed via lag() were negative due to JHU retroactive data corrections. Clamping these to zero is essential to prevent spurious negative totals in aggregations.
  • 05 Tidy format unlocks the full tidyverse toolchain — once in long format, standard group_by() + summarise(), filter(), and ggplot2 pipelines work directly on the data without any further reshaping — a key demonstration of why tidy data principles matter for analytical workflows.
05Contributions
LanguageR · Quarto
Packagestidyverse · tidyr · dplyr · lubridate · ggplot2 · plotly · countrycode · gt · scales
DatasetJohns Hopkins CSSE · time_series_covid19_confirmed_global.csv · Jan 2020–Mar 2023
Transformpivot_longer() · mdy() · lag() differencing · group_by() + summarise()
MapPlotly choropleth · ISO-3 codes via countrycode() · YlOrRd color scale
Time Seriesggplot2 geom_line() · scale_y_continuous(labels = comma) · theme_minimal()
CourseDATA 607 · Data Acquisition and Management · CUNY SPS · Project 2

Read the full analysis

Full Quarto report on RPubs · source code and dataset on GitHub.