Taming Text-to-Image Diffusion for Counterfactual Generation

1 CAI, AstraZeneca      2 CHAI Hub, University of Edinburgh      3 Shanghai AI Lab
Preprint

* Indicates Equal Contribution
Learning from model weights

Counterfactual generation is specified by two standard inputs: a predefined causal graph and semantic attributes that define the intervention. Causal-Adapter instantiates an SCM over the conditioning variables and tames an off-the-shelf text-to-image diffusion backbone to disentangle causal factors in the text-embedding space, generating faithful counterfactual images. It can integrates seamlessly with Stable Diffusion 1.5/3, FLUX etc., with lightweight adaptation.

Taming T2I Diffusion for Counterfactual Generation
“An off-the-shelf text-to-image diffusion model can be tamed with causal semantic attributes to generate faithful counterfactual images”

Causal-Adapter augments an SCM with prompt-aligned injection and a conditioned token contrastive loss to disentangle attributes, reduce spurious correlations, and achieve SOTA on synthetic and real-world datasets.

Key highlights:

  • Significant Gains: +50% intervention effectiveness; +87% image quality
  • Low Compute: Finetunes in 10 hours on 1 NVIDIA A10G (24GB).
  • Model-Agnostic: Works with Stable Diffusion 1.5, FLUX.1, and more.
  • Precise Attention: Better semantic–spatial alignment in diffusion latents.
  • Causal Graph Support: Supports learning causal structure (graph) from scratch.
  • Open-Source: Code, data & finetuned models will be available.

Disentangling Semantic Attributes and Propagating Causal Effects

Causal-Adapter enables faithful counterfactual image generation by intervening on specific attributes and propagating causal effects through the causal graph.

Demo image
Human: "What if the pend angle had been different?"
Pendulum traversal
-30° 30°
Human: "What if the light position had been different?"
Light traversal
-1 1
Demo image
Human: "What if the human age had been different?"
CelebA age traversal
young old
Human: "What if the human gender had been different?"
Gender traversal
male female
Demo image
Human: "What if the age had been different?"
ADNI age traversal
55 90
Human: "What if the vent volume had been different?"
Vent volume traversal
0 1

Key Insights and Figures

Causal-Adapter Fig
Figure 1. Motivational study: (a) Current T2I models often overlook continuous attributes, limiting fine-grained edits. (b) Existing T2I methods suffer from attribute entanglement. (c) Cross-attention maps: base vs. regularized Causal-Adapter.
Causal-Adapter Fig
Figure 2. A sketch comparison of counterfactual image generation methods based on: (a) VAE or GAN (b) Diffusion SCM and (c) Diffusion autoencoder (d) T2I based editing (e) Vanilla Causal-Adapter (f) Causal-Adapter with attribute regularization

Method

fig 3

We propose Causal-Adapter, a simple yet module that plugs into a pretrained text-to-image diffusion model to generate faithful counterfactual images. Given an input image and an intervention specified by a causal graph and semantic attributes, Causal-Adapter injects causal signals into the text conditioning (via prompt-aligned injection) and is trained with reconstruction and contrastive objectives to disentangle attributes and reduce spurious correlations. At inference, intervened attributes update the conditioning to produce counterfactuals, with optional attention guidance for localized edits while preserving non-intervened identity cues.

BibTeX

@article{tong2025causal,
  title={Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation},
  author={Tong, Lei and Liu, Zhihua and Lu, Chaochao and Oglic, Dino and Diethe, Tom and Teare, Philip and Tsaftaris, Sotirios A and Jin, Chen},
  journal={arXiv preprint arXiv:2509.24798},
  year={2025}
}