Causal-Adapter: Taming Text-to-Image Diffusion
for Faithful Counterfactual Generation

A modular framework that adapts frozen T2I diffusion models for faithful, causally-grounded counterfactual image synthesis.

Lei Tong*1, Zhihua Liu*2, Chaochao Lu3, Dino Oglic1, Tom Diethe1, Philip Teare1, Sotirios A. Tsaftaris2, Chen Jin1
1 CAI, AstraZeneca, Cambridge    2 CHAI Hub, University of Edinburgh    3 Shanghai AI Lab
* Equal Contribution
ICML 2026
Counterfactual generation illustration

Counterfactual generation is specified by two standard inputs: a predefined causal graph and semantic attributes that define the intervention.

Causal-Adapter instantiates a structural causal model (SCM) over the conditioning variables and tames an off-the-shelf text-to-image diffusion backbone to disentangle causal factors in the text-embedding space, generating faithful counterfactual images.

Key Highlights

"An off-the-shelf T2I diffusion model can be tamed with causal semantic attributes to generate faithful counterfactual images."

Significant Gains

+50% intervention effectiveness; +87% image quality (FID) improvement over prior methods.

Low Compute

Fine-tunes in 10 hours on a single NVIDIA A10G (24 GB).

Model-Agnostic

Works with SD 1.5, SD 3, FLUX.1, and future T2I backbones.

Precise Attention

Better semantic-spatial alignment in diffusion latents via attention guidance.

Causal Graph Support

Supports learning causal structure (graph) from scratch when none is provided.

Open-Source

Code, data, and fine-tuned models will be publicly available.

We present Causal-Adapter, a modular framework that adapts frozen text-to-image diffusion for counterfactual generation. Our method enables causal interventions on target attributes, consistently propagating their effects to causal dependents without altering the core identity of the image. In contrast to prior approaches that rely on prompt engineering without explicit causal mechanism, Causal-Adapter leverages structural causal modeling augmented with two attribute regularization strategies: prompt-aligned injection, which aligns causal attributes with textual embeddings for precise semantic control, and a conditioned token contrastive loss to disentangle attribute factors and reduce spurious correlations. Causal-Adapter achieves state-of-the-art performance on both synthetic and real-world datasets, with up to 91% MAE reduction on Pendulum for accurate attribute control and 87% FID reduction on ADNI for high-fidelity MRI generation. These results show that our approach enables robust, generalizable counterfactual editing with faithful attribute modification and strong identity preservation.

Motivation & Comparison

Motivation study
Problem A

Continuous attributes are ignored

Current T2I models treat attributes as binary switches. Fine-grained, continuous control (e.g., ventricle volume) is lost in embedding space.

Problem B

Attribute entanglement

Editing one attribute (age) inadvertently changes others (beard, hairstyle) because T2I latents lack causal disentanglement.

Our Fix

Regularized cross-attention

Causal-Adapter's PAI + CTC loss disentangle token embeddings, yielding clean, localized attention maps per attribute.

Method comparison sketch
(a-c)

Prior generative approaches

VAE/GAN (a) produce low-fidelity outputs. Diffusion SCM (b) and autoencoders (c) disentangle only in auxiliary encoders, leaving diffusion latents entangled.

(d)

T2I-based editing

Heavy prompt engineering without explicit causal mechanisms. Attention maps misguide edits.

(e-f)

Causal-Adapter (Ours)

Injects causal attributes into learnable token embeddings with contrastive optimization, achieving both disentanglement and high fidelity.

Method

Causal-Adapter operates in three stages, plugging into a frozen T2I diffusion backbone.

Causal-Adapter framework overview
1

Causal Mechanism Modeling

Given a causal graph G and semantic attributes Y, each causal mechanism fi is modeled via a nonlinear MLP with additive noise. The SCM propagates interventions to all downstream variables.

2

Prompt-Aligned Injection (PAI)

Causal attributes are injected into learnable token embeddings that replace placeholder tokens in the prompt. This aligns causal semantics with spatial features in cross-attention.

3

Conditioned Token Contrastive Loss

An InfoNCE-based loss pulls same-attribute tokens together and pushes different-attribute tokens apart, enforcing disentanglement and reducing spurious correlations.

At inference, abduction-action-prediction is performed via DDIM inversion. Optional Attention Guidance (AG) localizes edits to intervened tokens while preserving identity.

Interactive Counterfactual Traversals

Drag the sliders to intervene on causal attributes and observe how counterfactual images change in response.

Pendulum
P L SL SP X
P Pendulum L Light SL Shadow Len SP Shadow Pos X Image
What if the pendulum angle had been different?
Pendulum traversal
-30°30°
What if the light position had been different?
Light traversal
-11
CelebA
A G Br Bl X
A Age G Gender Br Beard Bl Bald X Image
What if the human age had been different?
CelebA age traversal
YoungOld
What if the human gender had been different?
Gender traversal
MaleFemale
ADNI Brain MRI
ApoE Sx A B S V X
ApoE Sx Sex A Age B Brain S Slice V Ventricle X Image
What if the brain age had been different?
ADNI age traversal
5590
What if the ventricle volume had been different?
Ventricle volume traversal
01

State-of-the-Art Performance

Evaluated on 4 datasets: Pendulum (synthetic), CelebA, CelebA-HQ (faces), and ADNI (brain MRI).

Pendulum
91% MAE Reduction

Accurate continuous control over pendulum angle, light, shadow length, and shadow position with causal propagation.

CelebA
81% FID Reduction
86% LPIPS Reduction

Best realism and composition. F1 scores: 99.9% (gender), 58.5% (age), 52.1% (beard).

ADNI Brain MRI
87% FID Reduction
50% MAE Reduction

High-fidelity MRI generation with precise attribute control on age, brain volume, and ventricle volume.

CelebA-HQ
Best Reversibility
Best Identity Pres.

State-of-the-art on eyeglasses and smiling interventions with strong reversibility and identity preservation.

Metrics: Effectiveness (F1/MAE) | Realism (FID) | Composition (LPIPS/MAE) | Minimality (CLD)

Related Projects

We have a list of interesting projects related to concept learning, prompt tuning and its application for novel content generation. You are welcome to check them out.

MCPL thumbnail
An Image is Worth Multiple Words: Discovering Object Level Concepts using Multi-Concept Prompt Learning (ICML 2024)

Multi-Concept Prompt Learning (MCPL) pioneers mask-free text-guided learning for multiple prompts from one scene. Our approach not only enhances current methodologies but also paves the way for novel applications, such as facilitating knowledge discovery through natural language-driven interactions between humans and machines.

Segment Anyword thumbnail
Segment anyword: Mask prompt inversion for open-set grounded segmentation (ICML 2025)

We leverage cross-attention maps from a diffusion inversion process to guide open-set grounded segmentation. This inversion helps mitigate the sensitivity to ambiguous text prompts. The resulting cross-attention based visual point prompts are further regularized using linguistic syntax and dependency information.

Lavender thumbnail
Lavender: Diffusion Instruction Tuning (ICML 2025)

Lavender (Language-and-Vision fine-tuning with Diffusion Aligner) is a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion.

Causal-Adapter thumbnail
Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation (ICML 2026)

We present Causal-Adapter, a modular method that tames frozen text-to-image diffusions for counterfactual image generation. The method enables causal interventions, consistently propagates their effects to dependent attributes and preserves identity.

BibTeX

@article{tong2025causal,
  title={Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation},
  author={Tong, Lei and Liu, Zhihua and Lu, Chaochao and Oglic, Dino and Diethe, Tom and Teare, Philip and Tsaftaris, Sotirios A and Jin, Chen},
  journal={arXiv preprint arXiv:2509.24798},
  year={2025}
}