Human evaluation involves having people assess AI-generated outputs for quality, accuracy, helpfulness, and other criteria. While slower and more expensive than automated metrics, human evaluation captures nuances that algorithms miss and serves as the ground truth for what "good" output looks like.
Human evaluation is essential for validating automated metrics and evaluating subjective qualities.