Post-Intervention Behavior Taxonomy

Abstract

A taxonomy of post-intervention behavior patterns, clustered by how system behavior reorganizes after deployment controls. The framework maps recurring adaptive responses to mitigation and provides a structured vocabulary for analyzing post-intervention dynamics across domains.

A meta-level map of how systems reorganize after interventions. These clusters describe how expression, boundaries, meaning, and history reshape behavior post-deployment.

Cluster I

Expression-Capability Decoupling

(Surface behavior diverges from latent capacity)

These patterns describe cases where the system’s ability remains, but its visible form mutates. They are about masking, bending, fragmenting, or flattening expression.

Includes:

Behavioral Camouflage
Capability Shadowing
Capability Compression
Behavioral Whitening

Shared signature: Capability != Expression. Safety operates on surfaces. Capabilities live underneath. Intervention reshapes how capacity appears, not whether it exists.

Why this cluster matters: Most evaluation treats output as a proxy for capability. These patterns say output is no longer a reliable proxy. This undermines many existing safety assumptions.

Cluster II

Boundary Negotiation Dynamics

(Actors learn to live inside invisible thresholds)

These patterns describe co-evolution between users and boundaries.

Includes:

Threshold Calibration
Refusal Template Overfitting

Shared signature: Boundaries do not end behavior. They teach behavior. Users learn shapes that pass. Safety becomes interactive.

Why this cluster matters: Safety is not a wall. It is a training signal. Which means safety produces new behaviors.

Cluster III

Surface Coherence Breakdown

(No single authoritative safety surface exists)

These patterns describe fragmentation across layers and policies.

Includes:

Alignment Surface Fragmentation
Latent Policy Conflict
Interpretability Collapse

Shared signature: Multiple control layers. No unified causal story. Humans cannot trace why outcomes happen.

Why this cluster matters: Governance assumes explainability. These patterns describe its erosion.

Cluster IV

Temporal Drift and Memory Effects

(Past states shape present behavior invisibly)

These patterns describe how history becomes structure.

Includes:

Prompt Dependency Lock-In
Documentation Reality Drift
Silent Regression

Shared signature: Old assumptions persist. New behavior accumulates. Nothing resets cleanly.

Why this cluster matters: Systems are path-dependent. Treating updates as fresh starts is a category error.

Cluster V

Semantic Translation Distortion

(Meaning mutates across representational layers)

These patterns describe meaning shear between languages, abstractions, and compositions.

Includes:

Localized Meaning Shear
Inherited Risk

Shared signature: Safety is not preserved through translation, whether linguistic or architectural.

Why this cluster matters: Compositional safety is assumed. These patterns show it is fragile.

Cluster VI

Perceptual Illusions of Improvement

(Signals of safety increase without structural risk reduction)

These patterns describe misleading indicators.

Includes:

Safety Layer Illusion
Behavioral Whitening (partially lives here as well)

Shared signature: Visible compliance increases. Latent risk unchanged.

Why this cluster matters: Organizations optimize dashboards. Dashboards do not measure regime change.

Higher-Order Axes (optional but powerful)

Axis A - Where Reorganization Occurs

Expression layer
Boundary layer
Semantic layer
Temporal layer
Governance layer

Axis B - Who Learns

Model learns
User learns
Organization learns
Nobody learns (but system drifts anyway)

Learning location matters. It predicts future failure shapes.

Why This Clustering Is Important

Because it gives you:

A map
A vocabulary
A way to design diagnostics

Instead of: “Let’s test everything.” You can now say: “We are scanning for Expression-Capability Decoupling and Boundary Negotiation patterns.” That is operational.

A Simple Internal Diagram

Post-Intervention AI Behavior -> Expression-Capability Decoupling -> Boundary Negotiation -> Surface Coherence Breakdown -> Temporal Drift -> Semantic Distortion -> Perceptual Illusions

Core Patterns (15)

Behavioral Camouflage (you already named) Capability persists. Expression changes. Direct -> blocked. Indirect -> allowed. Safety appears effective. Risk becomes harder to see.
Capability Shadowing After safety fine-tuning, the model stops expressing certain skills in obvious form, but latent capability still activates in complex tasks. Example: “I cannot write malware.” But can still describe exploit primitives when framed as “defensive analysis.” Effect: Capabilities become fragmented across task shapes. Why post-intervention: Safety layers reshape surface affordances, not internal representations.
Threshold Calibration Users learn the invisible line. They do not know where the policy is, but they learn how to stay just inside it. Prompt style evolves: more context, more hedging, more roleplay, more hypothetical framing. Effect: Conflict dissolves into calibration. Safety becomes a negotiation, not a boundary.
Refusal Template Overfitting Models learn a small set of refusal patterns. Attackers probe for edges. Developers accidentally design around them. Effect: Safety becomes brittle. Looks strong on known cases. Weak on novel ones.
Safety Layer Illusion System appears safer because refusals increase, but underlying capability is unchanged. Effect: Organizations interpret “more refusals” as “less risk.” False confidence.
Prompt Dependency Lock-In After deployment, applications evolve prompt scaffolding. Those prompts become part of the system’s identity. When model version changes, behavior collapses. Effect: Prompt ecosystems ossify around past model quirks.
Documentation Reality Drift Docs describe what the system was supposed to do. Model behavior evolves. Docs lag. Effect: Users blame themselves. Support load rises. Trust erodes.
Localized Meaning Shear Same prompt in different languages yields different safety behavior. Not because translation is wrong, but because safety tuning is language-uneven. Effect: Global product is not the same as global safety posture.
Alignment Surface Fragmentation Different safety techniques stack: RLHF, system prompts, content filters, external classifiers. Each layer has its own logic. Effect: No single coherent safety surface. Contradictions emerge.
Latent Policy Conflict Safety policies conflict internally. Model oscillates: sometimes strict, sometimes permissive. Same prompt -> different outcome. Effect: Users perceive randomness, but it is structural inconsistency.
Capability Compression Safety tuning reduces expressive richness. Model becomes flatter. Safer. Also less useful. Effect: Users push harder to get value, which re-opens risk.
Silent Regression Safety update fixes one class of behavior but breaks another. Nobody notices until downstream. Effect: Post-intervention drift without alarms.
Inherited Risk Downstream tools inherit upstream model behavior, even if they add their own safety layers. Effect: Risk multiplies through composition.
Behavioral Whitening Model outputs become more generic after tuning. Appears polite. Appears aligned. But semantic content still encodes problematic structure. Effect: Risk hides inside blandness.
Interpretability Collapse As layers accumulate, understanding why a model refused or complied becomes opaque. Effect: Humans lose causal handles. System becomes magical. Magic systems are ungovernable.

Abstract

Cluster I

Expression-Capability Decoupling

Cluster II

Boundary Negotiation Dynamics

Cluster III

Surface Coherence Breakdown

Cluster IV

Temporal Drift and Memory Effects

Cluster V

Semantic Translation Distortion

Cluster VI

Perceptual Illusions of Improvement

Higher-Order Axes (optional but powerful)

Why This Clustering Is Important

A Simple Internal Diagram

Core Patterns (15)

Related by invariants