Benchmarks
Ethical benchmarks for evaluating AI systems.
FHIBE
Fairness/VisionFair Human-centric Image Benchmark for Evaluation - A globally diverse dataset for evaluating fairness in computer visio...
TrustLLM
Comprehensive SafetyA comprehensive framework for evaluating the trustworthiness of LLMs across six dimensions: truthfulness, safety, fairne...
TruthfulQA
TruthfulnessA benchmark to measure whether a language model is truthful in generating answers to questions, specifically targeting q...
BBQ (Bias Benchmark for QA)
Fairness/BiasA dataset of question sets designed to highlight social biases against people belonging to protected classes along nine ...
ETHICS Dataset
Moral JudgmentA benchmark for evaluating AI systems on their ability to make ethical judgments across multiple domains including justi...
SCRUPLES
Moral DilemmasA corpus of ethical judgments over real-life anecdotes, designed to evaluate whether AI can predict and explain human et...
RealToxicityPrompts
ToxicityA benchmark for evaluating the risk of neural language model degeneration into toxic language when given varying prompts...
Moral Machine
Autonomous VehiclesA platform for gathering human perspectives on moral decisions made by machine intelligence, focusing on autonomous vehi...