Transforms the Johns Hopkins University CSSE wide-format global COVID-19 confirmed
cases dataset into a tidy (long) structure using tidyr
and dplyr. Daily cases are computed from cumulative counts via
lag() differencing, missing values are handled systematically,
and country-level summaries are visualized through an
interactive Plotly choropleth map and a USA time series —
illustrating how tidy data principles enable cleaner, more reproducible analysis.
The Johns Hopkins CSSE COVID-19 dataset stores daily confirmed case counts in wide format — one column per date spanning from January 22, 2020 to March 10, 2023. This violates tidy data principles where every variable should have its own column and every observation its own row. With hundreds of date columns, standard analysis functions become impractical.
The core transformation converts this structure into a long format
where each row represents a single country-date observation with a
date variable and a confirmed_cases value — making the
data compatible with the full tidyverse toolchain.
Two key challenges had to be addressed. First, date values stored as column
names required parsing with mdy() after reshaping.
Second, the dataset stores cumulative counts — not daily cases —
requiring a lag() differencing step to extract the actual daily
new case figures per country and province.
A third challenge: some countries reported negative daily values
due to data corrections and revisions. These were clamped to zero with
if_else(daily_cases < 0, 0L, daily_cases) to prevent downstream
analysis errors while preserving data integrity.
The complete data transformation follows a six-step tidy pipeline. Each step
uses a single dplyr/tidyr verb, making the
transformation reproducible, readable, and easy to audit. The pipeline runs
entirely in-memory with no intermediate file writes.
read_csv(). The dataset contains 289 country/province rows × 1,144 date columns. Columns 1–4 are metadata (Province/State, Country/Region, Lat, Long); columns 5 onward are date-named case count columns.date (the former column name) and cumulative_cases (the value). This converts 289 × 1,144 wide format to ~330,000 rows × 6 columns — one row per country-province-date observation.mdy(date) converts date strings like "1/22/20" into proper R Date objects. Column names are standardized, whitespace trimmed with str_trim(), and the dataset reordered by country, province, and date for the subsequent lag computation.confirmed_cases = cumulative_cases - lag(cumulative_cases, default = 0). Grouped by (country, province) so the subtraction never crosses country boundaries.daily_cases and clamps negatives to 0 — eliminating artifacts from JHU data corrections and retroactive deduplication. Missing province values replaced with "None"; missing counts replaced with 0.group_by(country) %>% summarise() computes three metrics per country: total_cases (sum), max_daily_cases (peak single day), and avg_daily_cases (mean over reporting period). ISO-3 country codes added via countrycode() for Plotly choropleth mapping.Reproduced from the original Quarto report using Plotly.js — the same choropleth and time series generated by the R code. The global map uses the YlOrRd color scale (yellow → orange → red) to show average daily case burden across countries. Hover over any country to see its name, average daily cases, and single-day peak. The USA time series shows all five pandemic waves from Jan 2020 to Mar 2023.
confirmed_cases value.
lag() were negative due
to JHU retroactive data corrections. Clamping these to zero is essential to
prevent spurious negative totals in aggregations.
group_by() + summarise(),
filter(), and ggplot2 pipelines work directly
on the data without any further reshaping — a key demonstration of why
tidy data principles matter for analytical workflows.
pivot_longer(cols = 5:last_col()), parsed
1,143 date strings with mdy(), and computed daily cases from
cumulative counts using lag() differencing — all in a single
chained dplyr pipeline.
"None" replacement),
NA case counts (zero-fill), and negative daily values from retroactive JHU
corrections (clamping via if_else()) — all without modifying
the upstream source data.
plot_ly(type="choropleth") with ISO-3 country codes from
countrycode(), YlOrRd color scale, and custom hover text showing
country name, average daily cases, and peak daily cases per country.
total_cases, max_daily_cases, and
avg_daily_cases per country using grouped aggregation, formatted
with gt() for presentation including comma-separated numeric
formatting and column labels.
ggplot2 time series with scale_y_continuous(labels = comma)
that clearly distinguishes all five pandemic waves including the Omicron peak
of 1.02M daily cases on January 10, 2022.
| Language | R · Quarto |
| Packages | tidyverse · tidyr · dplyr · lubridate · ggplot2 · plotly · countrycode · gt · scales |
| Dataset | Johns Hopkins CSSE · time_series_covid19_confirmed_global.csv · Jan 2020–Mar 2023 |
| Transform | pivot_longer() · mdy() · lag() differencing · group_by() + summarise() |
| Map | Plotly choropleth · ISO-3 codes via countrycode() · YlOrRd color scale |
| Time Series | ggplot2 geom_line() · scale_y_continuous(labels = comma) · theme_minimal() |
| Course | DATA 607 · Data Acquisition and Management · CUNY SPS · Project 2 |
Full Quarto report on RPubs · source code and dataset on GitHub.