RLHF (Reinforcement Learning from Human Feedback) is a training technique that fine-tunes language models using human preferences as the reward signal. Humans compare model outputs and indicate which is better, training a reward model that then guides the LLM toward more helpful, harmless, and honest responses.
RLHF is what transforms a pre-trained base model into an assistant that follows instructions and aligns with human values.