R · tidyr · Plotly · JHU CSSE · CUNY DATA 607

COVID-19 Global Cases —
Tidy Format Transformation & Interactive Visualization

Transforms the Johns Hopkins University CSSE wide-format global COVID-19 confirmed cases dataset into a tidy (long) structure using tidyr and dplyr. Daily cases are computed from cumulative counts via lag() differencing, missing values are handled systematically, and country-level summaries are visualized through an interactive Plotly choropleth map and a USA time series — illustrating how tidy data principles enable cleaner, more reproducible analysis.

R tidyr · dplyr pivot_longer Plotly ggplot2 JHU CSSE Choropleth CUNY DATA 607

Language

R · Quarto

Dataset

JHU CSSE Global COVID-19 — Jan 2020 → Mar 2023

Transform

Wide → Long via pivot_longer() · lag() differencing

Visualization

Plotly choropleth · ggplot2 time series

Course

DATA 607 · CUNY SPS · Project 2

Links

GitHub Repo ↗ RPubs Report ↗

01About

The Johns Hopkins CSSE COVID-19 dataset stores daily confirmed case counts in wide format — one column per date spanning from January 22, 2020 to March 10, 2023. This violates tidy data principles where every variable should have its own column and every observation its own row. With hundreds of date columns, standard analysis functions become impractical.

The core transformation converts this structure into a long format where each row represents a single country-date observation with a date variable and a confirmed_cases value — making the data compatible with the full tidyverse toolchain.

Two key challenges had to be addressed. First, date values stored as column names required parsing with mdy() after reshaping. Second, the dataset stores cumulative counts — not daily cases — requiring a lag() differencing step to extract the actual daily new case figures per country and province.

A third challenge: some countries reported negative daily values due to data corrections and revisions. These were clamped to zero with if_else(daily_cases < 0, 0L, daily_cases) to prevent downstream analysis errors while preserving data integrity.

02Pipeline

The complete data transformation follows a six-step tidy pipeline. Each step uses a single dplyr/tidyr verb, making the transformation reproducible, readable, and easy to audit. The pipeline runs entirely in-memory with no intermediate file writes.

Import wide-format CSV from GitHub

Read the JHU CSSE dataset directly from its GitHub raw URL using read_csv(). The dataset contains 289 country/province rows × 1,144 date columns. Columns 1–4 are metadata (Province/State, Country/Region, Lat, Long); columns 5 onward are date-named case count columns.

df <- read_csv(url, show_col_types = FALSE)

pivot_longer() — wide to tidy long format

All date columns (5 through last) are collapsed into two new columns: date (the former column name) and cumulative_cases (the value). This converts 289 × 1,144 wide format to ~330,000 rows × 6 columns — one row per country-province-date observation.

pivot_longer(cols = 5:last_col(), names_to = "date", values_to = "cumulative_cases")

Parse dates and clean column names

mdy(date) converts date strings like "1/22/20" into proper R Date objects. Column names are standardized, whitespace trimmed with str_trim(), and the dataset reordered by country, province, and date for the subsequent lag computation.

mutate(date = mdy(date), province = str_trim(`Province/State`))

Compute daily cases via lag() differencing

Since the source data is cumulative, daily new cases are extracted by subtracting the previous day's value within each country-province group: confirmed_cases = cumulative_cases - lag(cumulative_cases, default = 0). Grouped by (country, province) so the subtraction never crosses country boundaries.

group_by(country, province) %>% mutate(confirmed_cases = cumulative_cases - lag(cumulative_cases, default = 0))

Handle negatives and missing values

A second lag pass computes daily_cases and clamps negatives to 0 — eliminating artifacts from JHU data corrections and retroactive deduplication. Missing province values replaced with "None"; missing counts replaced with 0.

mutate(daily_cases = if_else(daily_cases < 0, 0L, daily_cases))

Aggregate country-level summaries

group_by(country) %>% summarise() computes three metrics per country: total_cases (sum), max_daily_cases (peak single day), and avg_daily_cases (mean over reporting period). ISO-3 country codes added via countrycode() for Plotly choropleth mapping.

summarise(total_cases = sum(...), avg_daily_cases = round(mean(...)))

03Live Dashboard

Live · Interactive · Plotly.js

Reproduced from the original Quarto report using Plotly.js — the same choropleth and time series generated by the R code. The global map uses the YlOrRd color scale (yellow → orange → red) to show average daily case burden across countries. Hover over any country to see its name, average daily cases, and single-day peak. The USA time series shows all five pandemic waves from Jan 2020 to Mar 2023.

Global COVID-19 Average Daily Cases by Country Jan 2020 – Mar 2023 · JHU CSSE · YlOrRd color scale

Daily COVID-19 Confirmed Cases — United States Jan 2020 – Mar 2023 · JHU CSSE · 5 distinct pandemic waves

04Key Results

Countries Tracked

190+

Nations in JHU dataset

Days of Data

1,143

Jan 22, 2020 – Mar 10, 2023

USA Peak Day

1.02M

Jan 10, 2022 · Omicron wave

Global Total

676M+

Confirmed cases worldwide

01 Wide → Long transformation yields 330,000+ rows — the 289-row × 1,144-column wide dataset pivots into ~330,000 tidy observations, each representing a unique country-province-date combination with a single confirmed_cases value.
02 US dominates global case counts — with ~103M total confirmed cases, the US accounts for the highest per-country total, followed by India (~44.7M), France (~38.9M), Germany (~38.4M), and Brazil (~37.5M). The choropleth reveals that high-income countries in North America and Western Europe show the heaviest case burdens.
03 Five distinct USA pandemic waves — the time series clearly shows: Wave 1 (spring 2020, ~70K peak), Wave 2 (summer 2020), Wave 3 (winter 2020-21, ~300K), Wave 4 (Delta, fall 2021, ~170K), and Wave 5 (Omicron, Jan 2022, ~1.02M peak) — each with different magnitude and duration.
04 Negative daily values require correction — approximately 2–3% of daily case values computed via lag() were negative due to JHU retroactive data corrections. Clamping these to zero is essential to prevent spurious negative totals in aggregations.
05 Tidy format unlocks the full tidyverse toolchain — once in long format, standard group_by() + summarise(), filter(), and ggplot2 pipelines work directly on the data without any further reshaping — a key demonstration of why tidy data principles matter for analytical workflows.

05Contributions

01 End-to-end tidy transformation pipeline — implemented the full wide-to-long pivot using pivot_longer(cols = 5:last_col()), parsed 1,143 date strings with mdy(), and computed daily cases from cumulative counts using lag() differencing — all in a single chained dplyr pipeline.
02 Robust data quality handling — addressed three distinct data quality issues: missing province values ("None" replacement), NA case counts (zero-fill), and negative daily values from retroactive JHU corrections (clamping via if_else()) — all without modifying the upstream source data.
03 Interactive Plotly choropleth map — created a world choropleth using plot_ly(type="choropleth") with ISO-3 country codes from countrycode(), YlOrRd color scale, and custom hover text showing country name, average daily cases, and peak daily cases per country.
04 Country-level summary statistics — computed total_cases, max_daily_cases, and avg_daily_cases per country using grouped aggregation, formatted with gt() for presentation including comma-separated numeric formatting and column labels.
05 USA pandemic wave analysis — filtered for the US, aggregated across all provinces/states by date, and plotted a ggplot2 time series with scale_y_continuous(labels = comma) that clearly distinguishes all five pandemic waves including the Omicron peak of 1.02M daily cases on January 10, 2022.

Language	R · Quarto
Packages	tidyverse · tidyr · dplyr · lubridate · ggplot2 · plotly · countrycode · gt · scales
Dataset	Johns Hopkins CSSE · time_series_covid19_confirmed_global.csv · Jan 2020–Mar 2023
Transform	pivot_longer() · mdy() · lag() differencing · group_by() + summarise()
Map	Plotly choropleth · ISO-3 codes via countrycode() · YlOrRd color scale
Time Series	ggplot2 geom_line() · scale_y_continuous(labels = comma) · theme_minimal()
Course	DATA 607 · Data Acquisition and Management · CUNY SPS · Project 2

Read the full analysis

Full Quarto report on RPubs · source code and dataset on GitHub.

GitHub Repo ↗ RPubs Report ↗ All Projects

COVID-19 Global Cases — Tidy Format Transformation & Interactive Visualization

Read the full analysis

COVID-19 Global Cases —
Tidy Format Transformation & Interactive Visualization