Story
Self-Conditioned Denoising for Atomistic Representation Learning
Key takeaway
Researchers developed a new machine learning technique to better understand materials at the atomic level. This could lead to improved designs for energy storage, electronics and other applications that rely on advanced materials.
Quick Explainer
Self-Conditioned Denoising (SCD) is a self-supervised pretraining approach for learning robust representations of atomistic data, such as molecules, proteins, and materials. The key idea is to use a conditional embedding that represents the uncorrupted target geometry, allowing the model to learn meaningful global and roto-translation invariant representations. This addresses limitations of standard node denoising, which fails to capture global semantics and focuses primarily on equivariant vector embeddings. SCD pretraining can leverage any atomistic data, including non-equilibrium geometries, and outperforms prior self-supervised methods while matching or exceeding the performance of supervised force-energy pretraining.
Deep Dive
Self-Conditioned Denoising for Atomistic Representation Learning
Overview
The success of large-scale pretraining in NLP and computer vision has catalyzed growing efforts to develop analogous foundation models for the physical sciences. However, pretraining strategies using atomistic data remain underexplored. To address this, the authors present Self-Conditioned Denoising (SCD), a self-supervised pretraining objective that leverages self-embeddings for conditional denoising across diverse atomistic data, including small molecules, proteins, and periodic materials.
The key contributions are:
- SCD significantly outperforms prior self-supervised learning methods on downstream benchmarks, and matches or exceeds the performance of supervised force-energy pretraining when controlling for architecture and data.
- A small, fast GNN pretrained with SCD can achieve competitive or superior performance to larger models trained on significantly larger labeled or unlabeled datasets, across tasks in multiple domains.
- SCD benefits from pretraining on any atomistic data source, including non-equilibrium geometries, and does not require labeled force-energy data.
Methodology
Limitations of Node Denoising
The authors identify three key limitations of standard node denoising pretraining:
- Locality: Node denoising fails to capture global semantic representations due to the locality of Gaussian noise perturbations.
- Scalar vs. Vector Embeddings: Node denoising primarily impacts the equivariant vector ($L=1$) embeddings, without sufficient pressure on the roto-translation invariant scalar ($L=0$) embeddings needed for downstream property prediction.
- Non-Equilibrium Ambiguity: Standard denoising is ineffective on "non-equilibrium" high-energy geometries, as the model cannot distinguish between corrupted inputs and valid target structures.
Self-Conditioned Denoising (SCD)
To address these limitations, SCD introduces a conditional embedding that represents the uncorrupted target geometry. This conditioning:
- Provides the model with enough information to distinguish between corrupted samples and valid target conformations (addressing 3).
- Encourages the development of meaningful invariant ($L=0$) representations, in addition to equivariant ($L=1$) embeddings (addressing 2).
- Pushes the model to develop better global/semantic representations (addressing 1).
The SCD objective can be implemented with any diffusion-style architecture. The authors use a modified version of the TorchMD-Net (ET) backbone, incorporating Adaptive Layer Normalization (AdaNorm) to enable conditional inputs.
Data & Experimental Setup
The authors evaluate SCD pretraining on a range of single-domain datasets, as well as a combined multi-domain dataset:
- Small Molecules: PCQM4Mv2 (PCQ), GEOM10
- Periodic Materials: Alex-MP-20 (AMP20)
- Protein-Ligand: Structurally Augmented IC50 Repository (SAIR)
- Combined: All of the above (AllAtoms or ALL)
They benchmark the pretrained models on three representative tasks:
- QM9 HOMO energy prediction for small molecules
- Matbench bandgap prediction for periodic materials
- Ligand Binding Affinity (LBA) prediction for protein-ligand complexes
Results
SCD Outperforms Other SSL Methods
Compared to standard node denoising, SCD pretraining yields 19.6-45.5% improvements across most QM9 targets.
SCD Benefits from Any Atomistic Data Source
SCD pretraining derived significant benefits from every dataset tested, including non-equilibrium geometries. A 10% subset of the PCQ dataset (300k samples) provided nearly the same performance as the full 3.4M sample dataset.
Smoother Latent Spaces Correlate with Better Performance
Models pretrained with SCD exhibit smoother, less extensive latent spaces compared to untrained or standard node denoising models. This corresponds with lower error on downstream property prediction tasks.
SCD Matches or Exceeds Supervised Pretraining
When controlled for architecture and pretraining dataset, SCD pretraining matches or outperforms the performance of supervised force-energy pretraining on QM9, Matbench, and LBA tasks. A small, fast GNN pretrained with SCD can achieve competitive or superior results to larger models trained on significantly more data.
Limitations & Uncertainties
- The authors did not investigate scaling SCD to much larger models and datasets, which may uncover additional benefits.
- The underlying reasons for SCD's effectiveness on non-equilibrium and cross-domain data are not fully explained.
- It's unclear how SCD would perform on tasks that explicitly require accurate force predictions, rather than just property prediction.
What Comes Next
The authors suggest that SCD demonstrates the potential for self-supervised learning to enable scalable foundation model pretraining in the atomistic sciences, reducing the reliance on expensive DFT-labeled data. Future work could explore:
- Scaling SCD to much larger models and datasets
- Understanding the generalization capabilities of SCD-pretrained models
- Applying SCD to tasks that require accurate force prediction
