Story

Self-Conditioned Denoising for Atomistic Representation Learning

PhysicsMaterials & Engineering

Key takeaway

Researchers developed a new machine learning technique to better understand materials at the atomic level. This could lead to improved designs for energy storage, electronics and other applications that rely on advanced materials.

Read the paper

Quick Explainer

Self-Conditioned Denoising (SCD) is a self-supervised pretraining approach for learning robust representations of atomistic data, such as molecules, proteins, and materials. The key idea is to use a conditional embedding that represents the uncorrupted target geometry, allowing the model to learn meaningful global and roto-translation invariant representations. This addresses limitations of standard node denoising, which fails to capture global semantics and focuses primarily on equivariant vector embeddings. SCD pretraining can leverage any atomistic data, including non-equilibrium geometries, and outperforms prior self-supervised methods while matching or exceeding the performance of supervised force-energy pretraining.

Deep Dive

Self-Conditioned Denoising for Atomistic Representation Learning

Overview

The success of large-scale pretraining in NLP and computer vision has catalyzed growing efforts to develop analogous foundation models for the physical sciences. However, pretraining strategies using atomistic data remain underexplored. To address this, the authors present Self-Conditioned Denoising (SCD), a self-supervised pretraining objective that leverages self-embeddings for conditional denoising across diverse atomistic data, including small molecules, proteins, and periodic materials.

The key contributions are:

SCD significantly outperforms prior self-supervised learning methods on downstream benchmarks, and matches or exceeds the performance of supervised force-energy pretraining when controlling for architecture and data.
A small, fast GNN pretrained with SCD can achieve competitive or superior performance to larger models trained on significantly larger labeled or unlabeled datasets, across tasks in multiple domains.
SCD benefits from pretraining on any atomistic data source, including non-equilibrium geometries, and does not require labeled force-energy data.

Methodology

Limitations of Node Denoising

The authors identify three key limitations of standard node denoising pretraining:

Locality: Node denoising fails to capture global semantic representations due to the locality of Gaussian noise perturbations.
Scalar vs. Vector Embeddings: Node denoising primarily impacts the equivariant vector ($L=1$) embeddings, without sufficient pressure on the roto-translation invariant scalar ($L=0$) embeddings needed for downstream property prediction.
Non-Equilibrium Ambiguity: Standard denoising is ineffective on "non-equilibrium" high-energy geometries, as the model cannot distinguish between corrupted inputs and valid target structures.

Self-Conditioned Denoising (SCD)

To address these limitations, SCD introduces a conditional embedding that represents the uncorrupted target geometry. This conditioning:

Provides the model with enough information to distinguish between corrupted samples and valid target conformations (addressing 3).
Encourages the development of meaningful invariant ($L=0$) representations, in addition to equivariant ($L=1$) embeddings (addressing 2).
Pushes the model to develop better global/semantic representations (addressing 1).

The SCD objective can be implemented with any diffusion-style architecture. The authors use a modified version of the TorchMD-Net (ET) backbone, incorporating Adaptive Layer Normalization (AdaNorm) to enable conditional inputs.

Data & Experimental Setup

The authors evaluate SCD pretraining on a range of single-domain datasets, as well as a combined multi-domain dataset:

Small Molecules: PCQM4Mv2 (PCQ), GEOM10
Periodic Materials: Alex-MP-20 (AMP20)
Protein-Ligand: Structurally Augmented IC50 Repository (SAIR)
Combined: All of the above (AllAtoms or ALL)

They benchmark the pretrained models on three representative tasks:

QM9 HOMO energy prediction for small molecules
Matbench bandgap prediction for periodic materials
Ligand Binding Affinity (LBA) prediction for protein-ligand complexes

Results

SCD Outperforms Other SSL Methods

Compared to standard node denoising, SCD pretraining yields 19.6-45.5% improvements across most QM9 targets.

SCD Benefits from Any Atomistic Data Source

SCD pretraining derived significant benefits from every dataset tested, including non-equilibrium geometries. A 10% subset of the PCQ dataset (300k samples) provided nearly the same performance as the full 3.4M sample dataset.

Smoother Latent Spaces Correlate with Better Performance

Models pretrained with SCD exhibit smoother, less extensive latent spaces compared to untrained or standard node denoising models. This corresponds with lower error on downstream property prediction tasks.

SCD Matches or Exceeds Supervised Pretraining

When controlled for architecture and pretraining dataset, SCD pretraining matches or outperforms the performance of supervised force-energy pretraining on QM9, Matbench, and LBA tasks. A small, fast GNN pretrained with SCD can achieve competitive or superior results to larger models trained on significantly more data.

Limitations & Uncertainties

The authors did not investigate scaling SCD to much larger models and datasets, which may uncover additional benefits.
The underlying reasons for SCD's effectiveness on non-equilibrium and cross-domain data are not fully explained.
It's unclear how SCD would perform on tasks that explicitly require accurate force predictions, rather than just property prediction.

What Comes Next

The authors suggest that SCD demonstrates the potential for self-supervised learning to enable scalable foundation model pretraining in the atomistic sciences, reducing the reliance on expensive DFT-labeled data. Future work could explore:

Scaling SCD to much larger models and datasets
Understanding the generalization capabilities of SCD-pretrained models
Applying SCD to tasks that require accurate force prediction

Source

Self-Conditioned Denoising for Atomistic Representation Learning
PreprintarXiv cs.LG3/19/2026