AvailableBlack Hat USA 2025 · 2025-08 · 50 min

Automated Red Teaming of AI Workloads

A deep dive into adversarial generation frameworks used to bypass safety filters on modern foundation models.

Slides (forthcoming)Recording (forthcoming)

This talk came from a year of running adversarial generation at scale against production AI workloads. The central finding: the safety research community and the deployment engineering community are solving different problems, and the gap between them is where most production bypasses live.

Adversarial generation vs enumeration

Most red-team work against LLMs is enumeration: try known jailbreak patterns and record which ones work. Adversarial generation is different. You use a second model to automatically produce variations on working bypasses, optimising each variation against the target's refusal signals. This moves from 'can a human find a bypass' to 'can a machine sweep the full prompt space'.

python
# Simplified adversarial generation loop
while refusal_rate > threshold:
    candidate = attacker_model.generate(
        base_prompt=working_bypass,
        mutation_strategy=current_strategy,
        avoid_patterns=known_refusal_triggers,
    )
    result = target_model.evaluate(candidate)
    if result.bypassed:
        working_bypass = candidate
        bypass_history.append(candidate)

What the framework found

Across eight production deployments tested during the research phase, the framework found a working bypass in every one. Time-to-first-bypass ranged from under a second (no RLHF hardening) to around 40 minutes (active content filtering, hardened RLHF). No deployment was immune to sustained automated attack.

WARNING

The implication for deployers: your safety posture needs to assume bypasses will be found, and your defence-in-depth must operate downstream of the model. Logging, output filtering, and capability restriction are not alternatives to model hardening. They are the layer that catches what model hardening misses.

What this means for deployment

Treat LLM safety as a probability distribution, not a binary control. A bypass that requires 40 minutes of sustained automated attack represents a fundamentally different threat model than one achievable by hand in five minutes. Your risk posture should reflect that distinction - not every bypass class warrants the same mitigation priority.

← Back to all talks