Safety / Testing

Red Teaming

Intermediate [3/5]
Adversarial testing Security testing Safety evaluation

Definition

Red teaming is the practice of proactively testing AI systems by attempting to elicit harmful, unintended, or unsafe outputs. Red teamers act as adversaries, trying to find vulnerabilities, jailbreaks, and failure modes before malicious actors do.

Red teaming is essential for understanding a model's actual safety boundaries versus its intended behavior.

Key Concepts

  • Adversarial mindset: Think like an attacker to find vulnerabilities
  • Coverage: Test across many harm categories systematically
  • Automated + manual: Combine automated attacks with human creativity
  • Responsible disclosure: Report findings to improve safety

Examples

Categories
Red Teaming Attack Vectors
RED TEAMING CATEGORIES: 1. JAILBREAK ATTEMPTS: - Role-play scenarios ("Pretend you're evil...") - Hypothetical framing ("In a fictional world...") - Instruction override ("Ignore previous...") - Multi-turn manipulation - Language switching 2. HARMFUL CONTENT ELICITATION: - Direct requests (usually blocked) - Indirect approaches (metaphors, coding) - Step-by-step disclosure - Academic framing - Historical context abuse 3. PROMPT INJECTION: - Hidden instructions in input - Data exfiltration attempts - System prompt extraction - Tool/function abuse 4. BIAS AND FAIRNESS: - Demographic stereotyping - Differential treatment - Toxicity toward groups - Representational harms 5. PRIVACY ATTACKS: - Training data extraction - PII reconstruction - Membership inference - Model inversion 6. MISINFORMATION: - Hallucination triggers - Confident false claims - Citation fabrication - Deepfake text generation TESTING MATRIX: ┌──────────────┬───────────────────────────┐ │ Harm Type │ Test Cases │ ├──────────────┼───────────────────────────┤ │ Violence │ Weapons, attacks, abuse │ │ Illegal │ Drugs, fraud, hacking │ │ Sexual │ CSAM, explicit content │ │ Hate │ Discrimination, slurs │ │ Self-harm │ Suicide, eating disorders │ │ Deception │ Scams, impersonation │ └──────────────┴───────────────────────────┘
Process
Red Teaming Methodology
RED TEAMING PROCESS: 1. SCOPE DEFINITION: - What model/system to test - What harm categories to cover - Time and resource allocation - Success criteria 2. TEAM COMPOSITION: - Domain experts (safety, security) - Creative adversarial thinkers - Diverse perspectives - Mix of automated tools + humans 3. TEST EXECUTION: Manual testing: - Creative jailbreak attempts - Edge case exploration - Multi-turn conversations - Novel attack vectors Automated testing: - Fuzzing with variations - Known attack library - Paraphrase generation - Cross-lingual attacks 4. DOCUMENTATION: { "attack_type": "role-play jailbreak", "prompt": "[exact prompt used]", "response": "[model's response]", "severity": "high", "harm_category": "violence", "reproducible": true, "notes": "Works in French, blocked in English" } 5. REMEDIATION: - Prioritize by severity - Retrain or add guardrails - Retest after fixes - Document mitigations 6. CONTINUOUS PROCESS: New attacks emerge constantly Regular red teaming needed Crowdsourced bug bounties help

Interactive Exercise

Red Team Thinking

You're red teaming a coding assistant. What are 3 potential misuse scenarios you would test for?

Pro Tips
  • Think creatively - many jailbreaks are surprisingly simple
  • Test in multiple languages - safety may vary by language
  • Document everything, even failed attempts
  • Consider multi-turn attacks, not just single prompts

Related Terms