Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation

Tong, Lei; Liu, Zhihua; Lu, Chaochao; Oglic, Dino; Diethe, Tom; Teare, Philip; Tsaftaris, Sotirios A.; Jin, Chen

Counterfactual generation is specified by two standard inputs: a predefined causal graph and semantic attributes that define the intervention.

Causal-Adapter instantiates a structural causal model (SCM) over the conditioning variables and tames an off-the-shelf text-to-image diffusion backbone to disentangle causal factors in the text-embedding space, generating faithful counterfactual images.

At a Glance

Key Highlights

"An off-the-shelf T2I diffusion model can be tamed with causal semantic attributes to generate faithful counterfactual images."

Significant Gains

+50% intervention effectiveness; +87% image quality (FID) improvement over prior methods.

Low Compute

Fine-tunes in 10 hours on a single NVIDIA A10G (24 GB).

Model-Agnostic

Works with SD 1.5, SD 3, FLUX.1, and future T2I backbones.

Precise Attention

Better semantic-spatial alignment in diffusion latents via attention guidance.

Causal Graph Support

Supports learning causal structure (graph) from scratch when none is provided.

Open-Source

Code, data, and fine-tuned models will be publicly available.

We present Causal-Adapter, a modular framework that adapts frozen text-to-image diffusion for counterfactual generation. Our method enables causal interventions on target attributes, consistently propagating their effects to causal dependents without altering the core identity of the image. In contrast to prior approaches that rely on prompt engineering without explicit causal mechanism, Causal-Adapter leverages structural causal modeling augmented with two attribute regularization strategies: prompt-aligned injection, which aligns causal attributes with textual embeddings for precise semantic control, and a conditioned token contrastive loss to disentangle attribute factors and reduce spurious correlations. Causal-Adapter achieves state-of-the-art performance on both synthetic and real-world datasets, with up to 91% MAE reduction on Pendulum for accurate attribute control and 87% FID reduction on ADNI for high-fidelity MRI generation. These results show that our approach enables robust, generalizable counterfactual editing with faithful attribute modification and strong identity preservation.

Why Causal-Adapter?

Motivation & Comparison

Problem A

Continuous attributes are ignored

Current T2I models treat attributes as binary switches. Fine-grained, continuous control (e.g., ventricle volume) is lost in embedding space.

Problem B

Attribute entanglement

Editing one attribute (age) inadvertently changes others (beard, hairstyle) because T2I latents lack causal disentanglement.

Our Fix

Regularized cross-attention

Causal-Adapter's PAI + CTC loss disentangle token embeddings, yielding clean, localized attention maps per attribute.

(a-c)

Prior generative approaches

VAE/GAN (a) produce low-fidelity outputs. Diffusion SCM (b) and autoencoders (c) disentangle only in auxiliary encoders, leaving diffusion latents entangled.

(d)

T2I-based editing

Heavy prompt engineering without explicit causal mechanisms. Attention maps misguide edits.

(e-f)

Causal-Adapter (Ours)

Injects causal attributes into learnable token embeddings with contrastive optimization, achieving both disentanglement and high fidelity.

Method

Causal-Adapter operates in three stages, plugging into a frozen T2I diffusion backbone.

Causal Mechanism Modeling

Given a causal graph G and semantic attributes Y, each causal mechanism f_i is modeled via a nonlinear MLP with additive noise. The SCM propagates interventions to all downstream variables.

Prompt-Aligned Injection (PAI)

Causal attributes are injected into learnable token embeddings that replace placeholder tokens in the prompt. This aligns causal semantics with spatial features in cross-attention.

Conditioned Token Contrastive Loss

An InfoNCE-based loss pulls same-attribute tokens together and pushes different-attribute tokens apart, enforcing disentanglement and reducing spurious correlations.

At inference, abduction-action-prediction is performed via DDIM inversion. Optional Attention Guidance (AG) localizes edits to intervened tokens while preserving identity.

Interactive Counterfactual Traversals

Drag the sliders to intervene on causal attributes and observe how counterfactual images change in response.

Pendulum

P Pendulum L Light SL Shadow Len SP Shadow Pos X Image

What if the pendulum angle had been different?

-30°30°

What if the light position had been different?

-11

CelebA

A Age G Gender Br Beard Bl Bald X Image

What if the human age had been different?

YoungOld

What if the human gender had been different?

MaleFemale

ADNI Brain MRI

ApoE Sx Sex A Age B Brain S Slice V Ventricle X Image

What if the brain age had been different?

5590

What if the ventricle volume had been different?

Experiments

State-of-the-Art Performance

Evaluated on 4 datasets: Pendulum (synthetic), CelebA, CelebA-HQ (faces), and ADNI (brain MRI).

Pendulum

91% MAE Reduction

Accurate continuous control over pendulum angle, light, shadow length, and shadow position with causal propagation.

CelebA

81% FID Reduction

86% LPIPS Reduction

Best realism and composition. F1 scores: 99.9% (gender), 58.5% (age), 52.1% (beard).