Jailbreaking refers to techniques that manipulate an LLM into bypassing its safety guidelines and producing content it was trained to refuse. These attacks exploit gaps between intended behavior and actual learned behavior.
Understanding jailbreaks is crucial for building robust AI safety measures and improving model alignment.