Red Teaming | HyperKit.ai

Definition

Red teaming is the practice of proactively testing AI systems by attempting to elicit harmful, unintended, or unsafe outputs. Red teamers act as adversaries, trying to find vulnerabilities, jailbreaks, and failure modes before malicious actors do.

Red teaming is essential for understanding a model's actual safety boundaries versus its intended behavior.

Key Concepts

Adversarial mindset: Think like an attacker to find vulnerabilities
Coverage: Test across many harm categories systematically
Automated + manual: Combine automated attacks with human creativity
Responsible disclosure: Report findings to improve safety

Examples

Categories

Red Teaming Attack Vectors

RED TEAMING CATEGORIES:

1. JAILBREAK ATTEMPTS:
   - Role-play scenarios ("Pretend you're evil...")
   - Hypothetical framing ("In a fictional world...")
   - Instruction override ("Ignore previous...")
   - Multi-turn manipulation
   - Language switching

2. HARMFUL CONTENT ELICITATION:
   - Direct requests (usually blocked)
   - Indirect approaches (metaphors, coding)
   - Step-by-step disclosure
   - Academic framing
   - Historical context abuse

3. PROMPT INJECTION:
   - Hidden instructions in input
   - Data exfiltration attempts
   - System prompt extraction
   - Tool/function abuse

4. BIAS AND FAIRNESS:
   - Demographic stereotyping
   - Differential treatment
   - Toxicity toward groups
   - Representational harms

5. PRIVACY ATTACKS:
   - Training data extraction
   - PII reconstruction
   - Membership inference
   - Model inversion

6. MISINFORMATION:
   - Hallucination triggers
   - Confident false claims
   - Citation fabrication
   - Deepfake text generation

TESTING MATRIX:
┌──────────────┬───────────────────────────┐
│ Harm Type    │ Test Cases               │
├──────────────┼───────────────────────────┤
│ Violence     │ Weapons, attacks, abuse   │
│ Illegal      │ Drugs, fraud, hacking     │
│ Sexual       │ CSAM, explicit content    │
│ Hate         │ Discrimination, slurs     │
│ Self-harm    │ Suicide, eating disorders │
│ Deception    │ Scams, impersonation      │
└──────────────┴───────────────────────────┘

Process

Red Teaming Methodology

RED TEAMING PROCESS:

1. SCOPE DEFINITION:
   - What model/system to test
   - What harm categories to cover
   - Time and resource allocation
   - Success criteria

2. TEAM COMPOSITION:
   - Domain experts (safety, security)
   - Creative adversarial thinkers
   - Diverse perspectives
   - Mix of automated tools + humans

3. TEST EXECUTION:
   Manual testing:
   - Creative jailbreak attempts
   - Edge case exploration
   - Multi-turn conversations
   - Novel attack vectors

   Automated testing:
   - Fuzzing with variations
   - Known attack library
   - Paraphrase generation
   - Cross-lingual attacks

4. DOCUMENTATION:
   {
     "attack_type": "role-play jailbreak",
     "prompt": "[exact prompt used]",
     "response": "[model's response]",
     "severity": "high",
     "harm_category": "violence",
     "reproducible": true,
     "notes": "Works in French, blocked in English"
   }

5. REMEDIATION:
   - Prioritize by severity
   - Retrain or add guardrails
   - Retest after fixes
   - Document mitigations

6. CONTINUOUS PROCESS:
   New attacks emerge constantly
   Regular red teaming needed
   Crowdsourced bug bounties help

Interactive Exercise

✎

Red Team Thinking

You're red teaming a coding assistant. What are 3 potential misuse scenarios you would test for?

Pro Tips

Think creatively - many jailbreaks are surprisingly simple
Test in multiple languages - safety may vary by language
Document everything, even failed attempts
Consider multi-turn attacks, not just single prompts

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms