Claude Code for Adversarial Robustness (2026)

Last updated: April 17, 2026

Adversarial robustness has become a critical concern for machine learning practitioners deploying models in production. Malicious actors can craft input perturbations that cause well-trained models to misbehave catastrophically. Building a solid testing workflow helps you identify vulnerabilities before deployment and implement defensive measures that maintain model accuracy under attack.

Claude Code provides an excellent framework for automating adversarial robustness testing. This guide walks through building a comprehensive workflow that integrates perturbation generation, attack simulation, and defensive validation into your existing ML pipeline.

Understanding Adversarial Robustness

Adversarial robustness refers to a model’s ability to maintain correct predictions when faced with adversarial examples, inputs that have been specifically crafted to cause misclassification. These perturbations are often imperceptible to humans but can drastically alter model behavior.

The fundamental challenge is that gradient-based models create decision boundaries that, while accurate on training data, can be exploited through carefully constructed perturbations. Testing for these vulnerabilities requires specialized tools and methodologies that differ from standard ML validation.

Traditional unit tests verify that models work correctly on expected inputs. Adversarial robustness testing goes further by systematically exploring the input space around each prediction to identify failure modes that attackers might exploit.

Setting Up Your Robustness Testing Environment

Begin by creating a dedicated skill for adversarial testing. This skill will encapsulate the tools and workflows needed to evaluate model robustness:

#!/usr/bin/env python3
"""
Adversarial Robustness Testing Skill
Tests ML models against common adversarial attacks
"""
import numpy as np
import torch
import torch.nn as nn
from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass
@dataclass
class RobustnessConfig:
 """Configuration for robustness testing"""
 epsilon: float = 0.03
 num_iterations: int = 10
 step_size: float = 0.003
 attack_type: str = "fgsm"
 target_class: Optional[int] = None
class AdversarialTester:
 """Core adversarial testing implementation"""
 
 def __init__(self, model: nn.Module, config: RobustnessConfig):
 self.model = model
 self.config = config
 self.model.eval()
 
 def fgsm_attack(self, image: torch.Tensor, epsilon: float, 
 gradient: torch.Tensor) -> torch.Tensor:
 """Fast Gradient Sign Method attack"""
 perturbed = image + epsilon * gradient.sign()
 return torch.clamp(perturbed, 0, 1)
 
 def pgd_attack(self, image: torch.Tensor, epsilon: float,
 iterations: int, step_size: float) -> torch.Tensor:
 """Projected Gradient Descent attack"""
 perturbed = image.clone().detach()
 
 for i in range(iterations):
 perturbed.requires_grad = True
 output = self.model(perturbed)
 self.model.zero_grad()
 
 if self.config.target_class is not None:
 loss = nn.functional.cross_entropy(output, 
 torch.tensor([self.config.target_class]))
 else:
 loss = nn.functional.nll_loss(output)
 
 loss.backward()
 perturbed = perturbed + step_size * perturbed.grad.sign()
 
 # Project back to epsilon ball
 delta = torch.clamp(perturbed - image, -epsilon, epsilon)
 perturbed = torch.clamp(image + delta, 0, 1)
 
 return perturbed.detach()
 
 def test_robustness(self, test_data: List[Tuple]) -> Dict:
 """Run comprehensive robustness tests"""
 results = {
 "original_accuracy": 0,
 "adversarial_accuracy": 0,
 "perturbations_generated": 0,
 "vulnerable_samples": []
 }
 
 for idx, (image, label) in enumerate(test_data):
 image.requires_grad = True
 output = self.model(image.unsqueeze(0))
 original_pred = output.argmax(dim=1).item()
 
 if original_pred != label:
 continue
 
 results["original_accuracy"] += 1
 
 # Generate adversarial example
 if self.config.attack_type == "fgsm":
 self.model.zero_grad()
 loss = nn.functional.nll_loss(output, torch.tensor([label]))
 loss.backward()
 
 adv_image = self.fgsm_attack(image, self.config.epsilon,
 image.grad)
 else:
 adv_image = self.pgd_attack(image, self.config.epsilon,
 self.config.num_iterations,
 self.config.step_size)
 
 # Test adversarial example
 adv_output = self.model(adv_image.unsqueeze(0))
 adv_pred = adv_output.argmax(dim=1).item()
 
 results["perturbations_generated"] += 1
 
 if adv_pred != original_pred:
 results["vulnerable_samples"].append({
 "index": idx,
 "original_prediction": original_pred,
 "adversarial_prediction": adv_pred,
 "confidence_original": torch.softmax(output, dim=1)[0][original_pred].item(),
 "confidence_adversarial": torch.softmax(adv_output, dim=1)[0][adv_pred].item()
 })
 
 results["adversarial_accuracy"] = (
 len(test_data) - len(results["vulnerable_samples"])
 ) / len(test_data)
 results["original_accuracy"] /= len(test_data)
 
 return results

This implementation provides the foundation for testing models against Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) attacks, the two most common adversarial attack methodologies.

Automating Continuous Robustness Testing

Integrate adversarial testing into your CI/CD pipeline to catch robustness regressions before they reach production. Create a Claude Code skill that runs automatically:

#!/bin/bash
run-robustness-tests.sh
MODEL_PATH="$1"
TEST_DATA_PATH="$2"
REPORT_OUTPUT="$3"
echo "Running adversarial robustness tests..."
echo "Model: $MODEL_PATH"
echo "Test data: $TEST_DATA_PATH"
python3 -c "
import torch
from adversarial_tester import AdversarialTester, RobustnessConfig
Load model
model = torch.jit.load('$MODEL_PATH')
config = RobustnessConfig(epsilon=0.03, attack_type='pgd')
tester = AdversarialTester(model, config)
Run tests
test_data = torch.load('$TEST_DATA_PATH')
results = tester.test_robustness(test_data)
Generate report
print(f'Original Accuracy: {results[\"original_accuracy\"]:.2%}')
print(f'Adversarial Accuracy: {results[\"adversarial_accuracy\"]:.2%}')
print(f'Vulnerable Samples: {len(results[\"vulnerable_samples\"])}')
Exit with error if robustness below threshold
if results['adversarial_accuracy'] < 0.7:
 exit(1)
"
echo "Robustness tests completed successfully"

Schedule this to run nightly or on every model update. Set clear thresholds for acceptable robustness, typically 70% or higher adversarial accuracy for production systems.

Implementing Defensive Strategies

Once you identify vulnerabilities, implement defensive measures. Common strategies include adversarial training, input preprocessing, and ensemble methods:

def adversarial_training(model: nn.Module, 
 train_data: List,
 config: RobustnessConfig,
 epochs: int = 10):
 """
 Adversarial training combines standard training with
 adversarial examples to improve robustness
 """
 model.train()
 optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
 tester = AdversarialTester(model, config)
 
 for epoch in range(epochs):
 total_loss = 0
 
 for images, labels in train_data:
 # Generate adversarial examples
 images.requires_grad = True
 outputs = model(images)
 loss = nn.functional.cross_entropy(outputs, labels)
 
 model.zero_grad()
 loss.backward()
 
 # Adversarial training step
 adv_images = tester.fgsm_attack(images, config.epsilon,
 images.grad)
 
 # Combined training
 optimizer.zero_grad()
 standard_output = model(images)
 adv_output = model(adv_images)
 
 combined_loss = (
 nn.functional.cross_entropy(standard_output, labels) +
 nn.functional.cross_entropy(adv_output, labels)
 )
 
 combined_loss.backward()
 optimizer.step()
 total_loss += combined_loss.item()
 
 print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(train_data):.4f}")
 
 return model

Adversarial training significantly improves robustness but typically reduces clean accuracy by 1-3%. This trade-off is often acceptable for security-critical applications.

Measuring and Reporting Robustness Metrics

Create comprehensive reports that track robustness over time:

def generate_robustness_report(results: Dict, model_version: str) -> str:
 """Generate markdown robustness report"""
 
 report = f"""# Adversarial Robustness Report
 
Model Information
- Version: {model_version}
- Date: {datetime.now().isoformat()}
Test Results
- Original Accuracy: {results['original_accuracy']:.2%}
- Adversarial Accuracy: {results['adversarial_accuracy']:.2%}
- Robustness Gap: {results['original_accuracy'] - results['adversarial_accuracy']:.2%}
Vulnerability Analysis
- Total vulnerable samples: {len(results['vulnerable_samples'])}
"""
 
 if results['vulnerable_samples']:
 report += "### Most Vulnerable Cases\n\n"
 for sample in results['vulnerable_samples'][:5]:
 report += f"- Sample {sample['index']}: {sample['original_prediction']} → {sample['adversarial_prediction']}\n"
 
 return report

Track these metrics in a dashboard to visualize robustness trends across model versions. Sudden drops in adversarial accuracy indicate potential issues requiring immediate investigation.

Building the Complete Workflow

Tie everything together with a comprehensive skill that orchestrates the full adversarial robustness pipeline:

Invoke the adversarial robustness skill
/robustness-test --model ./models/production-classifier.pt \
 --data ./data/test/adversarial_benchmark.pt \
 --output ./reports/robustness-v1.2.0.md \
 --threshold 0.75 \
 --attacks fgsm,pgd

This workflow handles model loading, test execution, metric calculation, and report generation. Integrate it with your model registry to automatically test every new model version before deployment.

Building solid ML systems requires proactive security testing. Claude Code makes adversarial robustness evaluation accessible through automation, comprehensive tooling, and continuous monitoring. Start with basic FGSM testing and progressively add more sophisticated attacks and defenses as your workflow matures.

Step-by-Step Guide: Building a Continuous Robustness Testing Pipeline

Here is a concrete approach to integrating adversarial robustness testing into your ML deployment workflow.

Step 1. Establish your robustness baseline. Before implementing defenses, measure your current model’s robustness by running FGSM attacks at multiple epsilon values and plotting the accuracy curve. This baseline tells you how much robustness you need to improve and gives you a metric to track over time. Claude Code generates the baseline measurement script and the matplotlib visualization.

Step 2. Integrate attacks into your evaluation script. Extend your existing model evaluation script to include adversarial examples alongside clean test data. The evaluation should report both clean accuracy and adversarial accuracy at your chosen epsilon value. Claude Code generates the evaluation extension that adds adversarial metrics without requiring changes to your existing model code.

Step 3. Add robustness tests to your CI pipeline. Add a step that runs the robustness evaluation script after every model training run. Claude Code generates the GitHub Actions job that loads the latest checkpoint, runs adversarial evaluation, and fails the CI job if adversarial accuracy drops below your threshold.

Step 4. Implement adversarial training as an experiment. Create a training variant that mixes clean and adversarial examples in each batch. Track both clean and adversarial accuracy in your experiment tracking system. Claude Code generates the modified training loop and the MLflow logging configuration that records both metrics across epochs.

Step 5. Deploy robustness monitoring for production models. Add input validation at inference time that detects statistical anomalies consistent with adversarial examples. Claude Code generates the detection wrapper that computes feature statistics for each input, compares them to training distribution statistics, and flags anomalous inputs for review.

Common Pitfalls

Testing only with FGSM. FGSM is the simplest attack and a model that resists FGSM may still be vulnerable to stronger attacks like PGD or AutoAttack. A model that claims robustness based on FGSM alone is likely overestimating its security. Claude Code generates an evaluation suite that includes multiple attack types at standardized epsilon values.

Using the wrong epsilon scale. The choice of epsilon depends on your input normalization. An epsilon of 0.03 is meaningful for pixels normalized to [0, 1] but nonsensical for other normalization schemes. Claude Code generates epsilon conversion utilities that translate between common normalization schemes.

Not separating adversarial training from test evaluation. The attacker in the evaluation should not have access to your defense. Claude Code generates the evaluation protocol that uses a fresh attack instance that treats the defended model as a black box.

Evaluating robustness only on aggregate accuracy. Some classes is much more vulnerable than others, and aggregate accuracy hides this. A model that is 95% accurate overall but 30% accurate on one safety-critical class is not solid. Claude Code generates the per-class robustness breakdown.

Treating adversarial robustness as a one-time task. Model robustness degrades when you retrain on new data or fine-tune for a different task. Claude Code generates the robustness regression test suite that runs on every model update.

Best Practices

Use certified defenses for high-assurance applications. Empirical defenses like adversarial training can be broken by adaptive attacks. For security-critical applications, use certified defenses based on randomized smoothing which provide provable guarantees. Claude Code generates the randomized smoothing wrapper and the certification script.

Report robustness results using standardized protocols. When sharing robustness results, use AutoAttack as the evaluation method. Results with FGSM are easy to overfit to and hard to compare across implementations. Claude Code generates the AutoAttack evaluation script and a result card template.

Maintain adversarial example datasets for regression testing. Save a fixed set of adversarial examples generated from your baseline evaluation. Use this fixed set to check for regressions after every model update. Claude Code generates the adversarial example archiving script and the regression comparison report.

Document your threat model explicitly. Robustness is meaningful only relative to a threat model: what epsilon constraint, what attack budget, what attack algorithm? Document these parameters explicitly in your model card. Claude Code generates the threat model documentation template.

Integration Patterns

Weights and Biases integration. Claude Code generates the W&B callback that logs adversarial accuracy, robustness gap, and vulnerability distribution to your existing W&B runs alongside your standard training metrics. Robustness trends across experiments are visible in the same dashboard as your clean accuracy curves.

Model registry integration. Claude Code generates the pre-registration robustness check that gates model promotion to production. Only models that pass the robustness threshold are eligible for promotion. The check result is logged as a model registry tag for auditability.

Red team automation. Claude Code generates a scheduled adversarial red team job that runs weekly against your production model endpoints using the latest attack algorithms, simulating the behavior of an attacker who periodically tries new approaches to defeat your defenses.

Input Preprocessing Defenses

Before reaching the model, input preprocessing can neutralize many adversarial perturbations. Feature squeezing reduces color depth or applies spatial smoothing to remove high-frequency perturbations. JPEG compression removes subtle pixel-level changes that adversarial attacks rely on. These preprocessing steps are fast, require no model retraining, and can be applied at inference time with minimal latency impact.

The trade-off is that preprocessing also changes benign inputs, which can slightly reduce clean accuracy. Claude Code generates the preprocessing pipeline with configurable strength parameters and a measurement script that quantifies the clean accuracy cost versus the adversarial robustness gain for each preprocessing step, helping you choose the right balance for your application.

Input certification through randomized smoothing adds Gaussian noise to inputs multiple times, runs inference on each noisy version, and uses the majority vote as the prediction. Certified robustness means you can mathematically prove that no perturbation within a certain radius can change the prediction. Claude Code generates the smoothed classifier wrapper and the certification procedure using the CDF-based radius calculation from the original randomized smoothing paper.

Try it: Paste your error into our Error Diagnostic for an instant fix.

I'm a solo developer in Vietnam. 50K Chrome extension users. $500K+ on Upwork. 5 Claude Max subscriptions running agent fleets in parallel. These are my actual CLAUDE.md templates, orchestration configs, and prompts. Not a course. Not theory. The files I copy into every project before I write a line of code. **[See what's inside →](https://zovo.one/lifetime?utm_source=ccg&utm_medium=cta-default&utm_campaign=claude-code-for-adversarial-robustness-workflow)** $99 once. Free forever. 47/500 founding spots left.