DiSCoVeR
A materials discovery algorithm geared towards exploring high performance candidates in new chemical spaces using composition-only.

Bulk modulus values overlaid on DensMAP densities (cropped).
The documentation describes the Descending from Stochastic Clustering Variance Regression (DiSCoVeR) algorithm, how to install mat_discover
, and basic usage (fit
/predict
, custom or built-in datasets, adaptive design, and cluster plots). Interactive plots for several types of
Pareto front plots are available. We also describe how to contribute, and what to do if you run into bugs or have questions. Various examples (including a teaching example), the interactive figures mentioned, and the Python API are also hosted at https://mat-discover.readthedocs.io. The open-access article is published at Digital Discovery. If you find this useful, please consider citing as follows:
Citing
Baird, S. G.; Diep, T. Q.; Sparks, T. D. DiSCoVeR: A Materials Discovery Screening Tool for High Performance, Unique Chemical Compositions. Digital Discovery 2022. https://doi.org/10.1039/D1DD00028D.
@article{bairdDiSCoVeRMaterialsDiscovery2022,
title = {{{DiSCoVeR}}: A {{Materials Discovery Screening Tool}} for {{High Performance}}, {{Unique Chemical Compositions}}},
shorttitle = {{{DiSCoVeR}}},
author = {Baird, Sterling Gregory and Diep, Tran Q. and Sparks, Taylor D.},
year = {2022},
month = feb,
journal = {Digital Discovery},
publisher = {{RSC}},
issn = {2635-098X},
doi = {10.1039/D1DD00028D},
abstract = {We present Descending from Stochastic Clustering Variance Regression (DiSCoVeR) (https://github.com/sparks-baird/mat_discover), a Python tool for identifying and assessing high-performing, chemically unique compositions relative to existing compounds using a combination of a chemical distance metric, density-aware dimensionality reduction, clustering, and a regression model. In this work, we create pairwise distance matrices between compounds via Element Mover's Distance (ElMD) and use these to create 2D density-aware embeddings for chemical compositions via Density-preserving Uniform Manifold Approximation and Projection (DensMAP). Because ElMD assigns distances between compounds that are more chemically intuitive than Euclidean-based distances, the compounds can then be clustered into chemically homogeneous clusters via Hierarchical Density-based Spatial Clustering of Applications with Noise (HDBSCAN*). In combination with performance predictions via Compositionally-Restricted Attention-Based Network (CrabNet), we introduce several new metrics for materials discovery and validate DiSCoVeR on Materials Project bulk moduli using compound-wise and cluster-wise validation methods. We visualize these via multi-objective Pareto front plots and assign a weighted score to each composition that encompasses the trade-off between performance and density-based chemical uniqueness. In addition to density-based metrics, we explore an additional uniqueness proxy related to property gradients in DensMAP space. As a validation study, we use DiSCoVeR to screen materials for both performance and uniqueness to extrapolate to new chemical spaces. Top-10 rankings are provided for the compound-wise density and property gradient uniqueness proxies. Top-ranked compounds can be further curated via literature searches, physics-based simulations, and/or experimental synthesis. Finally, we compare DiSCoVeR against the naive baseline of random search for several parameter combinations in an adaptive design scheme. To our knowledge, this is the first time automated screening has been performed with explicit emphasis on discovering high-performing, novel materials.},
langid = {english},
}
If you use this software, in addition to the above reference, please also cite the Zenodo DOI and state the version that you used:
Sterling Baird. (2022). sparks-baird/mat_discover. Zenodo. https://doi.org/10.5281/zenodo.5594678
@software{sterling_baird_2022_6116258,
author = {Sterling Baird},
title = {sparks-baird/mat\_discover},
month = feb,
year = 2022,
publisher = {Zenodo},
doi = {10.5281/zenodo.5594678},
url = {https://doi.org/10.5281/zenodo.5594678}
}
If you use this software as an installed dependency in another GitHub repository, please add mat_discover
to a requirements.txt
file in your repository via e.g.:
pip install pipreqs
pipreqs .
pipreqs
generates (at least a starting point) for a requirements.txt
file based on import statements in your working directory and subfolders. For an example, see requirements.txt
.
DiSCoVeR Workflow
Why you’d want to use this tool, whether it’s “any good”, alternative tools, and summaries of the workflow.
Why DiSCoVeR?
The primary anticipated use-case of DiSCoVeR is that you have some training data (chemical formulas and target property), and you would like to determine the “next best experiment” to perform based on a user-defined relative importance of performance vs. chemical novelty. You can even run the model without any training targets which is equivalent to setting the target weight as 0.
Is it any good?
Take an initial training set of 100 chemical formulas and associated Materials Project bulk moduli followed by 900 adaptive design iterations (x-axis) using random search, novelty-only (performance weighted at 0), a 50/50 weighting split, and performance-only (novelty weighted at 0). These are the columns. The rows are the total number of observed “extraordinary” compounds (top 2%), the total number of additional unique atoms, and total number of additional unique chemical formulae templates. In other words:
How many “extraordinary” compounds have been observed so far?
How many unique atoms have been explored so far? (not counting atoms already in the starting 100 formulas)
How many unique chemical templates (e.g. A2B3, ABC, ABC2) have been explored so far? (not counting templates already in the starting 100 formulas)
The 50/50 weighting split offers a good trade-off between performance and novelty. Click the image to navigate to the interactive figure which includes two additional rows: best so far and current observed.

We also ran some benchmarking against sklearn.neighbors.LocalOutlierFactor
(novelty detection algorithm) using mat2vec
and mod_petti
featurizations. The interactive results are available here.
Alternatives
This approach is similar to what you will find with Bayesian optimization (BO), but with explicit emphasis on chemical novelty. If you’re interested in doing Bayesian optimization, I recommend using Facebook/Ax (not affiliated). I am working on an implementation of composition-based Bayesian optimization using Ax (2021-12-10).
For alternative “suggest next experiment” materials discovery tools, see the Citrine Platform (proprietary), ChemOS (proprietary), Olympus, CAMD (trihackathon2020 tutorial notebooks), PyChemia, Heteroscedastic-BO, and thermo.
For materials informatics (MI) and other relevant codebases/links, see:
my lists of (total ~200) MI codebases, in particular:
composition-, crystal structure-, and molecule-based predictions
MI databases, especially NOMAD and MPDS
Other lists of MI-relevant codebases:
this curated list of “Awesome” materials informatics (~100 as of 2021-12-10)
Visualization
The DiSCoVeR workflow is visualized as follows:

Figure 1: DiSCoVeR workflow to create chemically homogeneous clusters. (a) Training and validation data are obtained inthe form of chemical formulas and target properties (i.e. performance). (b) The training and validation chemical formulasare combined and used to compute ElMD pairwise distances. (c) ElMD pairwise distance matrices are used to computeDensMAP embeddings and DensMAP densities. (d) DensMAP embeddings are used to compute HDBSCAN* clusters.(e) Validation target property predictions are made via CrabNet and plotted against the uniqueness proxy (e.g. densityproxy) in the form of a Pareto front plot. Discovery scores are assigned based on the (arbitrarily) weighted sum of scaledperformance and uniqueness proxy. Higher scores are better. (f) HDBSCAN* clustering results can be used to obtain acluster-wise performance (e.g. average target property) plotted against a cluster-wise uniqueness proxy (e.g. fraction ofvalidation compounds vs. total compounds within a cluster).
Tabular Summary
A summary of the DiSCoVeR methods are given in the following table:
Table 1: A description of methods used in this work and each method’s role in DiSCoVeR. ∗A Pareto front is more information-dense than a proxy score in that there are no predefined relative weights for performance vs. uniqueness proxy. Compounds that are closer to the Pareto front are better. The upper areas of the plot represent a higher weight towards performance while the right-most areas of the plot represent a higher weight towards uniqueness.
Method |
What is it? |
What is its role in DiSCoVeR? |
---|---|---|
Composition-based property regression |
Predict performance for proxy scores |
|
Composition-based distance metric |
Supply distance matrix to DensMAP |
|
Density-aware dimensionality reduction |
Obtain densities for density proxy |
|
Density-aware clustering |
Create chemically homogeneous clusters |
|
Peak proxy |
High performance relative to nearby compounds |
Proxy for “surprising” high performance |
Density proxy |
Sparsity relative to nearby compounds |
Proxy for chemical novelty |
Peak proxy score |
Weighted sum of performance and peak proxy |
Used to rank compounds |
Density proxy score |
Weighted sum of performance and density proxy |
Used to rank compounds |
Optimal performance/uniqueness trade-offs |
Visually screen compounds (no weights*) |
Installation
There are three ways to install mat_discover
: Anaconda (conda
), PyPI (pip
), and from source. Anaconda is the preferred method.
Anaconda
After installing
Anaconda or
Miniconda (Miniconda preferred), first update conda
via:
conda update conda
Then add the following channels to your default channels list:
conda config --add channels conda-forge
conda config --add channels pytorch
I recommend that you run mat_discover
in a separate conda environment, at least for
initial testing. You can create a new environment in Python 3.9
(mat_discover
is also tested on 3.7
and 3.8
), install mat_discover
, and activate it via:
conda create --name mat_discover --channel sgbaird python==3.9.* mat_discover
conda activate mat_discover
In English, this reads as “Create a new environment named mat_discover
and install a version of Python that matches 3.9.*
(e.g. 3.9.7
) and the mat_discover
package while looking preferentially in the @sgbaird Anaconda channel. Activate the mat_discover
environment.”
Pip
Even if you use pip
to install mat_discover
, I still recommend doing so in a fresh conda
environment, at least for initial testing:
conda create --name mat_discover python==3.9.*
conda activate mat_discover
To install via pip
, first update pip
via:
pip install -U pip
Due to limitations of PyPI distributions of CUDA/PyTorch, you will need to install PyTorch separately via the command that’s most relevant to you (PyTorch Getting Started). For example, for Stable/Windows/Pip/Python/CUDA-11.3:
pip install torch==1.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
Finally, install mat_discover
:
pip install mat_discover
From Source
The same recommendation about using a fresh conda
environment for initial testing applies here. To install from source, clone the mat_discover
repository:
git clone https://github.com/sparks-baird/mat_discover.git
cd mat_discover
To perform the local installation, you can use pip
, conda
, or flit
. If using flit
, make sure to install it first via conda install flit
or pip install flit
.
pip |
conda |
flit |
---|---|---|
|
|
|
Basic Usage
How to fit
/predict
, use custom or built-in datasets, and perform adaptive design.
Fit/Predict
from mat_discover.mat_discover_ import Discover
disc = Discover(target_unit="GPa")
disc.fit(train_df) # DataFrames should have at minimum ("formula" or "structure") and "target" columns
scores = disc.predict(val_df)
disc.plot()
disc.save()
print(disc.dens_score_df.head(10), disc.peak_score_df.head(10))
Note that target_unit="GPa"
simply appends (GPa)
to the end of plotting labels where appropriate.
See
mat_discover_example.py,
,
or
.
On Google Colab and Binder, this may take a few minutes to install and load,
respectively. During training and prediction, Google Colab will be faster than Binder
since Google Colab has access to a GPU while Binder does not. Sometimes Binder takes a long time to load, so please consider using Open In Colab or the normal installation instructions instead.
Load Data
From File
If you’re using your own dataset, you will need to supply a Pandas DataFrame that
contains formula
(string) and target
(numeric) columns (optional for val_df
). If you have a train.csv
file
(located in current working directory) with these two columns, this can be converted to
a DataFrame via:
import pandas as pd
train_df = pd.read_csv("train.csv")
which might look something like the following:
formula |
target |
---|---|
Tc1V1 |
248.539 |
Cu1Dy1 |
66.8444 |
Cd3N2 |
91.5034 |
For validation data without known property values to be used with predict
, dummy
values (all zeros) are assigned internally if the target
column isn’t present. In this case, you can read in a CSV file
that contains only the formula
(string) column:
val_df = pd.read_csv("val.csv")
formula |
---|
Al2O3 |
SiO2 |
Hard-coded
For a quick hard-coded example, you could use:
train_df = pd.DataFrame(dict(formula=["Tc1V1", "Cu1Dy1", "Cd3N2"], target=[248.539, 66.8444, 91.5034]))
val_df = pd.DataFrame(dict(formula=["Al2O3", "SiO2"]))
CrabNet Datasets (including Matbench)
NOTE: you can load any of the datasets within CrabNet/data/
, which includes matbench
data, other datasets from the CrabNet paper, and a recent (as of Oct 2021) snapshot of K_VRH
bulk modulus data from Materials Project. For example, to load the bulk modulus snapshot:
from crabnet.data.materials_data import elasticity
train_df, val_df = disc.data(elasticity, "train.csv") # note that `val.csv` within `elasticity` is every other Materials Project compound (i.e. "target" column filled with zeros)
The built-in data directories are as follows:
{'benchmark_data', 'benchmark_data.CritExam__Ed', 'benchmark_data.CritExam__Ef', 'benchmark_data.OQMD_Bandgap', 'benchmark_data.OQMD_Energy_per_atom', 'benchmark_data.OQMD_Formation_Enthalpy', 'benchmark_data.OQMD_Volume_per_atom', 'benchmark_data.aflow__Egap', 'benchmark_data.aflow__ael_bulk_modulus_vrh', 'benchmark_data.aflow__ael_debye_temperature', 'benchmark_data.aflow__ael_shear_modulus_vrh', 'benchmark_data.aflow__agl_thermal_conductivity_300K', 'benchmark_data.aflow__agl_thermal_expansion_300K', 'benchmark_data.aflow__energy_atom', 'benchmark_data.mp_bulk_modulus', 'benchmark_data.mp_e_hull', 'benchmark_data.mp_elastic_anisotropy', 'benchmark_data.mp_mu_b', 'benchmark_data.mp_shear_modulus', 'element_properties', 'matbench', 'materials_data', 'materials_data.elasticity', 'materials_data.example_materials_property'}
To see what .csv
files are available (e.g. train.csv
), you will probably need to navigate to CrabNet/data/ and explore. For example, to use a snapshot of the Materials Project e_above_hull
dataset (mp_e_hull
):
from crabnet.data.benchmark_data import mp_e_hull
train_df = disc.data(mp_e_hull, "train.csv", split=False)
val_df = disc.data(mp_e_hull, "val.csv", split=False)
test_df = disc.data(mp_ehull, "test.csv", split=False)
Directly via Materials Project
Finally, to download data from Materials Project directly, see generate_elasticity_data.py.
Adaptive Design
The anticipated end-use of mat_discover
is in an adaptive design scheme where the objective function (e.g. wetlab synthesis and characterization) is expensive. After loading some data for a validation scenario (or your own data)
from crabnet.data.materials_data import elasticity
from mat_discover.utils.data import data
from mat_discover.adaptive_design import Adapt
train_df, val_df = data(elasticity, "train.csv", dummy=False, random_state=42)
train_df, val_df, extraordinary_thresh = extraordinary_split(
train_df, val_df, train_size=100, extraordinary_percentile=0.98, random_state=42
)
you can then predict your first additional experiment to run via:
adapt = Adapt(train_df, val_df, timed=False)
first_experiment = adapt.suggest_first_experiment() # fit Discover() to train_df, then move top-ranked from val_df to train_df
Subsequent experiments are suggested as follows:
second_experiment = adapt.suggest_next_experiment() # refit CrabNet, use existing DensMAP data, move top-ranked from val to train
third_experiment = adapt.suggest_next_experiment()
Alternatively, you can do this in a closed loop via:
n_iter = 100
adapt.closed_loop_adaptive_design(n_experiments=n_iter, print_experiment=False)
However, as the name suggests, the closed loop approach does not allow you to input data after each suggested experiment.
Cluster Plots
To quickly determine ElMD+DensMAP+HDBSCAN* cluster labels, make the following interactive cluster plot for your data, and export a “paper-ready” PNG image, you can or see the (nearly identical) example in
elmd_densmap_cluster.ipynb
.
Bugs, Questions, and Suggestions
If you find a bug or have suggestions for documentation please open an
issue. If you’re
reporting a bug, please include a simplified reproducer. If you have questions, have
feature suggestions/requests, or are interested in extending/improving mat_discover
and would like to discuss, please use the Discussions tab and use the appropriate
category (“Ideas”, “Q&A”, etc.). If you have a
question, please ask! I won’t bite. Pull requests are welcome and encouraged.
Looking for more?
See examples, including a teaching example, and the Python API.