Automated Red Teaming of AI Workloads
A deep dive into adversarial generation frameworks used to bypass safety filters on modern foundation models.
This talk came from a year of running adversarial generation at scale against production AI workloads. The central finding: the safety research community and the deployment engineering community are solving different problems, and the gap between them is where most production bypasses live.
Adversarial generation vs enumeration
Most red-team work against LLMs is enumeration: try known jailbreak patterns and record which ones work. Adversarial generation is different. You use a second model to automatically produce variations on working bypasses, optimising each variation against the target's refusal signals. This moves from 'can a human find a bypass' to 'can a machine sweep the full prompt space'.
python# Simplified adversarial generation loop while refusal_rate > threshold: candidate = attacker_model.generate( base_prompt=working_bypass, mutation_strategy=current_strategy, avoid_patterns=known_refusal_triggers, ) result = target_model.evaluate(candidate) if result.bypassed: working_bypass = candidate bypass_history.append(candidate)
What the framework found
Across eight production deployments tested during the research phase, the framework found a working bypass in every one. Time-to-first-bypass ranged from under a second (no RLHF hardening) to around 40 minutes (active content filtering, hardened RLHF). No deployment was immune to sustained automated attack.
What this means for deployment
Treat LLM safety as a probability distribution, not a binary control. A bypass that requires 40 minutes of sustained automated attack represents a fundamentally different threat model than one achievable by hand in five minutes. Your risk posture should reflect that distinction - not every bypass class warrants the same mitigation priority.