Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation

Tong, Lei; Liu, Zhihua; Lu, Chaochao; Oglic, Dino; Diethe, Tom; Teare, Philip; Tsaftaris, Sotirios A.; Jin, Chen

Counterfactual generation is specified by two standard inputs: a predefined causal graph and semantic attributes that define the intervention.

Causal-Adapter instantiates a structural causal model (SCM) over the conditioning variables and tames an off-the-shelf text-to-image diffusion backbone to disentangle causal factors in the text-embedding space, generating faithful counterfactual images.

At a Glance

Key Highlights

"An off-the-shelf T2I diffusion model can be tamed with causal semantic attributes to generate faithful counterfactual images."

Significant Gains

+50% intervention effectiveness; +87% image quality (FID) improvement over prior methods.

Low Compute

Fine-tunes in 10 hours on a single NVIDIA A10G (24 GB).

Model-Agnostic

Works with SD 1.5, SD 3, FLUX.1, and future T2I backbones.

Precise Attention

Better semantic-spatial alignment in diffusion latents via attention guidance.

Causal Graph Support

Supports learning causal structure (graph) from scratch when none is provided.

Open-Source

Code, data, and fine-tuned models will be publicly available.

We present Causal-Adapter, a modular framework that adapts frozen text-to-image diffusion for counterfactual generation. Our method enables causal interventions on target attributes, consistently propagating their effects to causal dependents without altering the core identity of the image. In contrast to prior approaches that rely on prompt engineering without explicit causal mechanism, Causal-Adapter leverages structural causal modeling augmented with two attribute regularization strategies: prompt-aligned injection, which aligns causal attributes with textual embeddings for precise semantic control, and a conditioned token contrastive loss to disentangle attribute factors and reduce spurious correlations. Causal-Adapter achieves state-of-the-art performance on both synthetic and real-world datasets, with up to 91% MAE reduction on Pendulum for accurate attribute control and 87% FID reduction on ADNI for high-fidelity MRI generation. These results show that our approach enables robust, generalizable counterfactual editing with faithful attribute modification and strong identity preservation.

Why Causal-Adapter?

Motivation & Comparison

Problem A

Continuous attributes are ignored

Current T2I models treat attributes as binary switches. Fine-grained, continuous control (e.g., ventricle volume) is lost in embedding space.

Problem B

Attribute entanglement

Editing one attribute (age) inadvertently changes others (beard, hairstyle) because T2I latents lack causal disentanglement.

Our Fix

Regularized cross-attention

Causal-Adapter's PAI + CTC loss disentangle token embeddings, yielding clean, localized attention maps per attribute.

(a-c)

Prior generative approaches

VAE/GAN (a) produce low-fidelity outputs. Diffusion SCM (b) and autoencoders (c) disentangle only in auxiliary encoders, leaving diffusion latents entangled.

(d)

T2I-based editing

Heavy prompt engineering without explicit causal mechanisms. Attention maps misguide edits.

(e-f)

Causal-Adapter (Ours)

Injects causal attributes into learnable token embeddings with contrastive optimization, achieving both disentanglement and high fidelity.

Method

Causal-Adapter operates in three stages, plugging into a frozen T2I diffusion backbone.

Causal Mechanism Modeling

Given a causal graph G and semantic attributes Y, each causal mechanism f_i is modeled via a nonlinear MLP with additive noise. The SCM propagates interventions to all downstream variables.

Prompt-Aligned Injection (PAI)

Causal attributes are injected into learnable token embeddings that replace placeholder tokens in the prompt. This aligns causal semantics with spatial features in cross-attention.

Conditioned Token Contrastive Loss

An InfoNCE-based loss pulls same-attribute tokens together and pushes different-attribute tokens apart, enforcing disentanglement and reducing spurious correlations.

At inference, abduction-action-prediction is performed via DDIM inversion. Optional Attention Guidance (AG) localizes edits to intervened tokens while preserving identity.

Interactive Counterfactual Traversals

Drag the sliders to intervene on causal attributes and observe how counterfactual images change in response.

Pendulum

P Pendulum L Light SL Shadow Len SP Shadow Pos X Image

What if the pendulum angle had been different?

-30°30°

What if the light position had been different?

-11

CelebA

A Age G Gender Br Beard Bl Bald X Image

What if the human age had been different?

YoungOld

What if the human gender had been different?

MaleFemale

ADNI Brain MRI

ApoE Sx Sex A Age B Brain S Slice V Ventricle X Image

What if the brain age had been different?

5590

What if the ventricle volume had been different?

Experiments

State-of-the-Art Performance

Evaluated on 4 datasets: Pendulum (synthetic), CelebA, CelebA-HQ (faces), and ADNI (brain MRI).

Pendulum

91% MAE Reduction

Accurate continuous control over pendulum angle, light, shadow length, and shadow position with causal propagation.

CelebA

81% FID Reduction

86% LPIPS Reduction

Best realism and composition. F1 scores: 99.9% (gender), 58.5% (age), 52.1% (beard).

ADNI Brain MRI

87% FID Reduction

50% MAE Reduction

High-fidelity MRI generation with precise attribute control on age, brain volume, and ventricle volume.

CelebA-HQ

Best Reversibility

Best Identity Pres.

State-of-the-art on eyeglasses and smiling interventions with strong reversibility and identity preservation.

Metrics: Effectiveness (F1/MAE) | Realism (FID) | Composition (LPIPS/MAE) | Minimality (CLD)

Explore More

Related Projects

We have a list of interesting projects related to concept learning, prompt tuning and its application for novel content generation. You are welcome to check them out.

An Image is Worth Multiple Words: Discovering Object Level Concepts using Multi-Concept Prompt Learning (ICML 2024)

Multi-Concept Prompt Learning (MCPL) pioneers mask-free text-guided learning for multiple prompts from one scene. Our approach not only enhances current methodologies but also paves the way for novel applications, such as facilitating knowledge discovery through natural language-driven interactions between humans and machines.

Segment anyword: Mask prompt inversion for open-set grounded segmentation (ICML 2025)

We leverage cross-attention maps from a diffusion inversion process to guide open-set grounded segmentation. This inversion helps mitigate the sensitivity to ambiguous text prompts. The resulting cross-attention based visual point prompts are further regularized using linguistic syntax and dependency information.

Lavender: Diffusion Instruction Tuning (ICML 2025)

Lavender (Language-and-Vision fine-tuning with Diffusion Aligner) is a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion.

Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation (ICML 2026)

We present Causal-Adapter, a modular method that tames frozen text-to-image diffusions for counterfactual image generation. The method enables causal interventions, consistently propagates their effects to dependent attributes and preserves identity.

BibTeX

@inproceedings{tong2026causaladapter,
  title     = {Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation},
  author    = {Tong, Lei and Liu, Zhihua and Lu, Chaochao and Oglic, Dino and Diethe, Tom and Teare, Philip and Tsaftaris, Sotirios A. and Jin, Chen},
  booktitle = {Proceedings of the Forty-third International Conference on Machine Learning},
  year      = {2026},
  url       = {https://openreview.net/forum?id=si8F5lk6Kg},
  note      = {arXiv:2509.24798}
}