Objectives: Federal open data initiatives that promote increased sharing of
federally collected data are important for transparency, data quality, trust,
and relationships with the public and state, tribal, local, and territorial
(STLT) partners. These initiatives advance understanding of health conditions
and diseases by providing data to more researchers, scientists, and
policymakers for analysis, collaboration, and valuable use outside CDC
responders. This is particularly true for emerging conditions such as COVID-19
where we have much to learn and have evolving data needs. Since the beginning
of the outbreak, CDC has collected person-level, de-identified data from
jurisdictions and currently has over 8 million records, increasing each day.
This paper describes how CDC designed and produces two de-identified public
datasets from these collected data.
Materials and Methods: Data elements were included based on the usefulness,
public request, and privacy implications; specific field values were suppressed
to reduce risk of reidentification and exposure of confidential information.
Datasets were created and verified for privacy and confidentiality using data
management platform analytic tools as well as R scripts.
Results: Unrestricted data are available to the public through Data.CDC.gov
and restricted data, with additional fields, are available with a data use
agreement through a private repository on GitHub.com.
Practice Implications: Enriched understanding of the available public data,
the methods used to create these data, and the algorithms used to protect
privacy of de-identified individuals allow for improved data use. Automating
data generation procedures allows greater and more timely sharing of data.