CR Innovation Initiative

Data Nutrition Project

Helping data scientists understand what's inside datasets before they're used in machine learning.


Artificial intelligence (AI) systems built on incomplete or biased data will often exhibit problematic outcomes, leading to negative unintended consequences that affect the very communities that are already marginalized, underserved, or underrepresented. And yet there are few, if any, standard methods of data analysis to check for the ‘health’ of data, particularly before model development.

Our Approach

With an aim of mitigating harms caused by automated decision-making systems, The Dataset Nutrition Label tool enhances context, content, and legibility of datasets. Drawing from the analogy of the Nutrition Facts Label on food, the Label highlights the “ingredients” of a dataset to help shed light on how (or whether) the dataset is healthy for use.

Led by Innovation Lab Fellow, Kasia Chmielinski, the Dataset Nutrition Label is a diagnostic framework that lowers the barrier to standardized data analysis by providing a distilled yet comprehensive overview of dataset “ingredients” before AI model development. This framework is optimized for the data practitioner journey and leverages potential use cases for the data alongside alerts or flags that highlight known issues and possible mitigation strategies.

The Label, now in its third generation and being leveraged across a diversity of domains including healthcare and humanitarian use cases, is intended to drive robust data analysis practices by making it easier and faster for data scientists to interrogate and select datasets; increase overall quality of models by driving the use of better and more appropriate datasets for those models; and enable the creation and publishing of responsible datasets by those who collect, clean and publish data.

At a glance information

The web-based label includes four distinct panes of information: About the Label (top bar), Metadata (left side), Use Case Matrix (top right), and Inference Risks (bottom right).

TaxBills NYC Dataset
The Label provides overall dataset information including intended and high risk uses, known risks, and quick information (badges) for key questions such as whether the data has undergone ethical review or includes data from human subjects.

Milestones Ahead

The Data Nutrition Project is a research organization and product development team composed of technologists, designers, academics and scientists. Together, we are excited to continue the work of driving better AI through the exploration and development of practical tools.

We plan to launch the third generation of the Dataset Nutrition Label and Label Maker Tool (beta) in early 2023, and will continue working on a number of initiatives for this year and beyond:

  • Publish a Labels Library. Creating additional Labels for high-impact datasets often used to train AI.
  • Conduct User Research. With data scientists, we plan to conduct user research to refine the utility and legibility of the Label for data scientists.
  • Research the Landscape. Continue research on the broader ecosystem of labeling and algorithmic accountability and make this broadly available for academics and policy makers.
Back to all initiatives