Benchmarks

Ethical benchmarks for evaluating AI systems.

FHIBE

Fairness/Vision

Fair Human-centric Image Benchmark for Evaluation - A globally diverse dataset for evaluating fairness in computer visio...

2025 10,318 images from 81 countries

TrustLLM

Comprehensive Safety

A comprehensive framework for evaluating the trustworthiness of LLMs across six dimensions: truthfulness, safety, fairne...

2024 30+ datasets

TruthfulQA

Truthfulness

A benchmark to measure whether a language model is truthful in generating answers to questions, specifically targeting q...

2022 817 questions

BBQ (Bias Benchmark for QA)

Fairness/Bias

A dataset of question sets designed to highlight social biases against people belonging to protected classes along nine ...

2022 58,492 examples

ETHICS Dataset

Moral Judgment

A benchmark for evaluating AI systems on their ability to make ethical judgments across multiple domains including justi...

2021 130K samples

SCRUPLES

Moral Dilemmas

A corpus of ethical judgments over real-life anecdotes, designed to evaluate whether AI can predict and explain human et...

2020 625K samples

RealToxicityPrompts

Toxicity

A benchmark for evaluating the risk of neural language model degeneration into toxic language when given varying prompts...

2020 100K prompts

Moral Machine

Autonomous Vehicles

A platform for gathering human perspectives on moral decisions made by machine intelligence, focusing on autonomous vehi...

2018 40M decisions from 233 countries