EPI-Eval

A curated collection of large epidemiological datasets, normalized to a single schema so they can be searched, joined, and benchmarked against each other.

What we track

Time-series surveillance data on infectious disease — primarily respiratory viruses (flu, COVID-19, RSV) and arboviral disease (dengue, Zika, chikungunya), with smaller coverage of notifiable, mortality, wastewater, and behavioural / search signals. Sources come from CDC, WHO, ECDC, PAHO, OWID, and national public-health agencies; we re-publish them as Parquet with a consistent set of row-level columns (date, location_id, location_level, optional condition / case_status / as_of) and a metadata header describing pathogens, geography, cadence, and per-column units.

Why

Forecasting and modeling work routinely stalls on data plumbing — finding the canonical version of a series, normalizing geography codes, reconciling reporting cadences, tracking when a source was last revised. The goal of this org is to do that work once, in the open.

Schema

Every dataset card on this org uses the same frontmatter format (schema v0.1), validated against a controlled vocabulary (vocabularies.yaml). Curated metadata (pathogens, license, units) lives alongside computed metadata (time coverage, row count, observed cadence) generated at ingest.

Contributing a dataset

The ingest pipeline is in apart-forecasting-tool/upload_pipeline. A new dataset is one ingest.py + card.yaml under upload_pipeline/sources/<source_id>/; the validator confirms schema fit before upload. Each new truth dataset auto-creates an empty <id>-predictions companion at upload time.

Datasets (21)

Respiratory

Syndromic / ED

Dataset Pathogens Geography Cadence
CDC NSSP / ESSENCE — ED visits for ILI / COVID / RSV influenza, sars-cov-2, rsv US weekly

Arboviral

Dataset Pathogens Geography Cadence
OpenDengue — national dengue case counts (V1.3) dengue multiple irregular

Mobility & contact

Dataset Pathogens Geography Cadence
Google Community Mobility Reports — global daily multiple daily

Search & behavioural

Dataset Pathogens Geography Cadence
Wikipedia pageviews — disease-article daily views influenza, sars-cov-2, rsv +6 multiple daily

Notifiable / other

Dataset Pathogens Geography Cadence
OWID Mpox — global daily compiled mpox multiple daily
WHO Global TB — annual country estimates tuberculosis multiple annual

Predictions

Each truth dataset has a companion EPI-Eval/<id>-predictions repo that accumulates community-submitted forecasts. Schema is long-format: one row per (target_date, [dim values…], quantile, value), with quantile = NULL reserved for the point estimate. Forecasters submit through the EPI-Eval dashboard; a maintainer reviews each PR before merging, and merged predictions show up on the corresponding truth dataset's Show predictions toggle in the dashboard, with a per-submitter leaderboard (MAE / WIS / rWIS / coverage).

Status

Active. Coverage and dataset list grow through PRs to the upload pipeline.