mat_discover API
Module contents
Data-driven materials discovery based on composition or structure.
Submodules
mat_discover.mat_discover_ module
Materials discovery using Earth Mover’s Distance, DensMAP embeddings, and HDBSCAN*.
Create distance matrix, apply densMAP, and create clusters via HDBSCAN* to search for interesting materials. For example, materials with high-target/low-density (density proxy) or high-target surrounded by materials with low targets (peak proxy).
- class mat_discover.mat_discover_.Discover(timed: bool = True, dens_lambda: float = 1.0, plotting: bool = False, pdf: bool = True, n_peak_neighbors: int = 10, radius=None, verbose: bool = True, dummy_run: bool = False, Scaler=<class 'sklearn.preprocessing._data.RobustScaler'>, figure_dir: ~typing.Union[str, ~os.PathLike] = 'figures', table_dir: ~typing.Union[str, ~os.PathLike] = 'tables', target_unit: ~typing.Optional[str] = None, use_plotly_offline: bool = True, mapper=None, novelty_learner: str = 'discover', novelty_prop: str = 'mod_petti', pred_weight: float = 1.0, proxy_weight: float = 1.0, nscores: int = 100, regressor=None, use_structure: bool = False, umap_cluster_kwargs: ~typing.Optional[~typing.MutableMapping] = None, umap_vis_kwargs: ~typing.Optional[~typing.MutableMapping] = None, hdbscan_kwargs: ~typing.Optional[~typing.MutableMapping] = None)[source]
Bases:
object
A Materials Discovery class.
Uses chemical-based distances, dimensionality reduction, clustering, and plotting to search for high performing, chemically unique compounds relative to training data.
- __init__(timed: bool = True, dens_lambda: float = 1.0, plotting: bool = False, pdf: bool = True, n_peak_neighbors: int = 10, radius=None, verbose: bool = True, dummy_run: bool = False, Scaler=<class 'sklearn.preprocessing._data.RobustScaler'>, figure_dir: ~typing.Union[str, ~os.PathLike] = 'figures', table_dir: ~typing.Union[str, ~os.PathLike] = 'tables', target_unit: ~typing.Optional[str] = None, use_plotly_offline: bool = True, mapper=None, novelty_learner: str = 'discover', novelty_prop: str = 'mod_petti', pred_weight: float = 1.0, proxy_weight: float = 1.0, nscores: int = 100, regressor=None, use_structure: bool = False, umap_cluster_kwargs: ~typing.Optional[~typing.MutableMapping] = None, umap_vis_kwargs: ~typing.Optional[~typing.MutableMapping] = None, hdbscan_kwargs: ~typing.Optional[~typing.MutableMapping] = None)[source]
Initialize a Discover() class.
- Parameters
timed (bool, optional) – Whether or not timing is reported, by default True
dens_lambda (float, optional) – “Controls the regularization weight of the density correlation term in densMAP. Higher values prioritize density preservation over the UMAP objective, and vice versa for values closer to zero. Setting this parameter to zero is equivalent to running the original UMAP algorithm.” Source: https://umap-learn.readthedocs.io/en/latest/api.html, by default 1.0
plotting (bool, optional) – Whether to create and save various compound-wise and cluster-wise figures, by default False
pdf (bool, optional) – Whether or not probability density function values are computed, by default True
n_peak_neighbors (int, optional) – Number of neighbors to consider when computing k_neigh_avg (i.e. peak proxy), by default 10
verbose (bool, optional) – Whether to print verbose information, by default True
dummy_run (bool, optional) – Whether to use MDS instead of UMAP to run quickly for small datasets. Note that MDS takes longer for UMAP for large datasets, by default False
Scaler (str or class, optional) – Scaler to use for weighted_score (i.e. weighted score of target and proxy values) Target and proxy are separately scaled using Scaler before taking the weighted sum. Possible values are “MinMaxScaler”, “StandardScaler”, “RobustScaler”, or an sklearn.preprocessing scaler class, by default RobustScaler.
figure_dir (str, optional) – Relative or absolute path to directory at which to save figures or tables, by default “figures” and “tables”, respectively. The directory will be created if it does not exist already. if dummy_run then append “dummy” to the folder via os.path.join.
table_dir (str, optional) – Relative or absolute path to directory at which to save figures or tables, by default “figures” and “tables”, respectively. The directory will be created if it does not exist already. if dummy_run then append “dummy” to the folder via os.path.join.
target_unit (Optional[str]) – Unit of target to use in various x, y, and color axes labels. If None, don’t add a unit to the labels. By default None.
use_plotly_offline (bool) – Whether to use offline.plot(fig) instead of fig.show(). Set to False for Google Colab. By default, True.
pred_weight (int, optional) – Weighting applied to the predicted, scaled target values, by default 1 (i.e. equal weighting between predictions and proxies). For example, to weight the predicted targets at twice that of the proxy values, set to 2 (while keeping the default of proxy_weight = 1)
novelty_learner (str or sklearn Regressor, optional) – Whether to use the DiSCoVeR algorithm (“discover”) or another learner for novelty detection (e.g. sklearn.neighbors.LocalOutlierFactor). By default “discover”.
novelty_prop (str, optional) – Which featurization scheme to use for determining novelty. “mod_petti” is is currently the only supported/tested option for the DiSCoVeR novelty_learner for speed considerations, though the other “linear” featurizers should technically be compatible (untested). The “vector” featurizers can be implemented, although with some code plumbing needed. See ElM2D [1]_ and ElMD supported featurizers [2]_. Possible options for sklearn-type novelty_learner-s are those supported by the CBFV [3]_ package (and assuming that all elements that appear in train/val datasets are supported). By default “mod_petti”.
proxy_weight (int, optional) – Weighting applied to the predicted, scaled proxy values, by default 1 (i.e. equal weighting between predictions and proxies when using default pred_weight = 1). For example, to weight the predicted, scaled targets at twice that of the proxy values, set to 2 while retaining pred_weight = 1.
nscores (int, optional) – Number of scores (i.e. compounds) to return in the CSV output files.
regressor (instantiated class, optional) – The regressor to use for predicting target values, e.g., CrabNet(), or CrabNet(epochs=40) (may be useful to decrease # epochs for smaller datasets). See CrabNet() API. Can be another instantiated class, which at minimum contains fit(dataframe) and predict(dataframe) methods, where dataframe is a pandas DataFrame with at minimum columns (“formula” or “structure”) and “target”. If None, then defaults to CrabNet() By default None.
use_structure (bool, optional) – Whether to use structure-based featurization instead of formula-based. If use_structure is False and regressor is None and mapper is None, then CrabNet and ElMD are used as the regressor and mapper, respectively. If use_structure is True and regressor is None and mapper is None, then M3GNet and GridRDF are used as the regressor and mapper, respectively. By default False.
umap_cluster_kwargs (dict, optional) – umap.UMAP kwargs that are passed directly into the UMAP embedder that is used for clustering and visualization, respectively. By default None. See basic UMAP parameters and the UMAP API. If this contains dens_lambda key, the value in the Discover class kwarg will take precedence.
umap_vis_kwargs (dict, optional) –
umap.UMAP kwargs that are passed directly into the UMAP embedder that is used for clustering and visualization, respectively. By default None. See basic UMAP parameters and the UMAP API. If this contains dens_lambda key, the value in the Discover class kwarg will take precedence.
hdbscan_kwargs (dict, optional) – hdbscan.HDBSCAN kwargs that are passed directly into the HDBSCAN clusterer. By default, None. See Parameter Selection for HDBSCAN* and the HDBSCAN API. If
min_cluster_size
is not specified, defaults to 50. Ifmin_samples
is not specified, defaults to 1. Ifcluster_selection_epsilon
is not specified, defaults to 0.63.
References
- cluster(umap_emb, min_cluster_size=50, min_samples=1)[source]
Cluster using HDBSCAN*.
- Parameters
umap_emb (nD Array) – DensMAP embedding coordinates.
min_cluster_size (int, optional) – “The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters.” (source: HDBSCAN* docs), by default 50
min_samples (int, optional) – “The number of samples in a neighbourhood for a point to be considered a core point.” (source: HDBSCAN* docs), by default 1
- Returns
clusterer – HDBSCAN clusterer fitted to UMAP embeddings.
- Return type
HDBSCAN class
- compute_log_density(r_orig=None)[source]
Compute the log density based on the radii.
- Parameters
r_orig (1d array, optional) – The original radii associated with the fitted DensMAP, by default None. If None, then defaults to self.std_r_orig.
- Returns
self.dens, self.log_dens – Densities and log densities associated with the original radii, respectively.
- Return type
1d array
Notes
Density is approximated as 1/r_orig
- data(module, **data_kwargs)[source]
Grab data from within the subdirectories (modules) of mat_discover.
- Parameters
module (Module) – The module within mat_discover that contains e.g. “train.csv”. For example, from crabnet.data.materials_data import elasticity
fname (str, optional) – Filename of text file to open.
dummy (bool, optional) – Whether to pare down the data to a small test set, by default False
groupby (bool, optional) – Whether to use groupby_formula to filter identical compositions
split (bool, optional) – Whether to split the data into train, val, and (optionally) test sets, by default True
val_size (float, optional) – Validation dataset fraction, by default 0.2
test_size (float, optional) – Test dataset fraction, by default 0.0
random_state (int, optional) – seed to use for the train/val/test split, by default 42
- Returns
DataFrame – If split==False, then the full DataFrame is returned directly
DataFrame, DataFrame – If test_size == 0 and split==True, then training and validation DataFrames are returned.
DataFrame, DataFrame, DataFrame – If test_size > 0 and split==True, then training, validation, and test DataFrames are returned.
- dens_targ_scatter()[source]
Target value scatter plot (colored by target value) overlay on densities.
- extract_emb_rad(trans)[source]
Extract densMAP embedding and radii.
- Parameters
trans (class) – A fitted UMAP class.
- Returns
emb – UMAP embedding
r_orig – original radii
r_emb – embedded radii
See also
umap.UMAP
UMAP class.
- extract_labels_probs(clusterer)[source]
Extract cluster IDs (labels) and probabilities from HDBSCAN* clusterer.
- Parameters
clusterer (HDBSCAN class) – Instantiated HDBSCAN* class for clustering.
- Returns
labels_ (ndarray, shape (n_samples, )) – “Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1.” (source: HDBSCAN* docs)
probabilities_ (ndarray, shape (n_samples, )) – “The strength with which each sample is a member of its assigned cluster. Noise points have probability zero; points in clusters have values assigned proportional to the degree that they persist as part of the cluster.” (source: HDBSCAN* docs)
- fit(train_df)[source]
Fit CrabNet model to training data.
- Parameters
train_df (DataFrame) – Should contain (“formula” or “structure”) and “target” columns.
- group_cross_val(df, umap_random_state=None, dummy_run=None)[source]
Perform leave-one-cluster-out cross-validation (LOCO-CV).
- Parameters
df (DataFrame) – Contains “formula” and “target” (all the data)
umap_random_state (int, optional) – Random state to use for DensMAP embedding, by default None
dummy_run (bool, optional) – Whether to perform a “dummy run” (i.e. use multi-dimensional scaling which is faster), by default None
- Returns
Scaled, weighted error based on Wasserstein distance (i.e. a sorting distance).
- Return type
float
- Raises
ValueError – Needs to have at least one cluster. It is assumed that there will always be a non-cluster (i.e. unclassified points) if there is only 1 cluster.
Notes
TODO: highest mean vs. highest single target value
- load(fpath='disc.pkl')[source]
Load Discover model.
- Parameters
fpath (str, optional) – Filepath from which to load, by default “disc.pkl”
- Returns
Loaded Discover() model.
- Return type
Class
- merge(nscores=100)[source]
Perform an outer merge of the density and peak proxy rankings.
- Returns
Outer merge of the two proxy rankings.
- Return type
DataFrame
- mvn_prob_sum(emb, r_orig, n=100)[source]
Gridded multivariate normal probability summation.
- Parameters
emb (ndarray) – Clustering embedding.
r_orig (1d array) – Original DensMAP radii.
n (int, optional) – Number of points along the x and y axes (total grid points = n^2), by default 100
- Returns
x (1d array) – x-coordinates
y (1d array) – y-coordinates
pdf_sum (1d array) – summed densities at the (x, y) locations
- pf_frac_proxy()[source]
Cluster-wise average vs. cluster-wise validation fraction Pareto plot.
In other words, the average performance of a cluster vs. cluster novelty.
- pf_peak_proxy()[source]
Predicted target vs. peak proxy pareto plot.
Peak proxy gives an idea of how “surprising” the performance is (i.e. a local peak in the ElMD space).
- pf_train_contrib_proxy()[source]
Predicted target vs train contribution to validation log density pareto plot.
This is only for the validation data. Training data contribution to validation log density is a proxy for chemical novelty (i.e. how novel is a given validation datapoint relative to the training data).
- plot(return_pareto_ind: bool = False)[source]
Plot and save various cluster and Pareto front figures.
- Parameters
return_pareto_ind (bool, optional) – Whether to return the pareto front indices, by default False
- Returns
pk_pareto_ind, dens_pareto_ind – Pareto front indices for the peak and density proxies, respectively.
- Return type
tuple of int
- predict(val_df, plotting: Optional[bool] = None, umap_random_state=None, pred_weight=None, proxy_weight=None, dummy_run: Optional[bool] = None, count_repeats: bool = False, return_peak: bool = False)[source]
Predict target and proxy for validation dataset.
- Parameters
val_df (DataFrame) – Validation dataset containing at minimum (“formula” or “structure”) and optionally “target” (targets are populated with 0’s if not available).
plotting (bool, optional) – Whether to plot, by default None
umap_random_state (int or None, optional) – The random seed to use for UMAP, by default None
pred_weight (int, optional) – The weight to assign to the scaled target predictions (proxy_weight = 1 by default), by default None. If neither pred_weight nor self.pred_weight is specified, it defaults to 1.
proxy_weight (int, optional) – The weight to assign to the scaled proxy predictions (pred_weight is 1 by default), by default None. When specified, proxy_weight takes precedence over self.proxy_weight. If neither proxy_weight nor self.proxy_weight is specified, it defaults to 1.
dummy_run (bool, optional) – Whether to use MDS in place of the (typically more expensive) DensMAP, by default None. If neither dummy_run nor self.dummy_run is specified, it defaults to (effectively) being False. When specified, dummy_run takes precedence over self.dummy_run.
count_repeats (bool, optional) – Whether repeat chemical formulae should intensify the local density (i.e. decrease the novelty) or not. By default False.
return_peak (bool, optional) – Whether or not to return the peak scores in addition to the density-based scores. By default, False.
- Returns
Scaled discovery scores for density and peak proxies. Returns only dens_score if return_peak is False, which is the default.
- Return type
dens_score, peak_score
- px_umap_cluster_scatter()[source]
Interactive scatter plot of DensMAP embeddings colored by clusters.
- save(fpath='disc.pkl', dummy=False)[source]
Save Discover model.
- Parameters
fpath (str, optional) – Filepath to which to save, by default “disc.pkl”
See also
load
load a Discover model.
- single_group_cross_val(X, y, train_index, val_index, iter)[source]
Perform leave-one-cluster-out cross-validation.
- Parameters
X (DataFrame of str or Structure objects) – Chemical formulae or pymatgen Structure objects.
y (Series of float) – Target properties.
train_index (1d array of int) – Training and validation indices for a given split, respectively.
val_index (1d array of int) – Training and validation indices for a given split, respectively.
iter (int) – Iteration (i.e. how many clusters have been processed so far).
- Returns
true_avg_targ, pred_avg_targ, train_avg_targ – True, predicted, and training average targets for each of the clusters. average target is used to create a “dummy” measure of performance (i.e. one of the the simplest “models” you can use, the average of the training data).
- Return type
1d array of float
- sort(score, proxy_name='density')[source]
Sort (rank) compounds by their proxy score.
- Parameters
score (1D Array) – Discovery scores for the given proxy given by proxy_name.
proxy_name (string, optional) – Name of the proxy, by default “density”. Possible values are “density” (self.val_dens), “peak” (self.k_neigh_avg), and “radius” (self.val_rad_neigh_avg).
- Returns
Contains (“formula” or “structure”), “prediction”, proxy_name, and “score”.
- Return type
DataFrame
- umap_fit_cluster(dm, metric='precomputed', random_state=None)[source]
Perform DensMAP fitting for clustering.
See https://umap-learn.readthedocs.io/en/latest/clustering.html.
- Parameters
dm (ndarray) – Pairwise Element Mover’s Distance (ElMD) matrix within a single set of points.
metric (str) – Which metric to use for DensMAP, by default “precomputed”.
random_state (int, RandomState instance or None, optional (default: None)) – “If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.” (source: UMAP docs)
- Returns
umap_trans – A UMAP class fitted to dm.
- Return type
UMAP class
See also
umap.UMAP
UMAP class.
- umap_fit_vis(dm, random_state=None)[source]
Perform DensMAP fitting for visualization.
See https://umap-learn.readthedocs.io/en/latest/clustering.html.
- Parameters
dm (ndarray) – Pairwise Element Mover’s Distance (ElMD) matrix within a single set of points.
random_state (int, RandomState instance or None, optional (default: None)) – “If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.” (source: UMAP docs)
- Returns
std_trans – A UMAP class fitted to dm.
- Return type
UMAP class
See also
umap.UMAP
UMAP class.
- weighted_score(pred, proxy, pred_weight=None, proxy_weight=None, pred_scaler=None, proxy_scaler=None)[source]
Calculate weighted discovery score using the predicted target and proxy.
- Parameters
pred (1D Array) – Predicted target property values.
proxy (1D Array) – Predicted proxy values (e.g. density or peak proxies).
pred_weight (int, optional) – The weight to assign to the scaled predictions, by default 1
proxy_weight (int, optional) – The weight to assign to the scaled proxies, by default 1
- Returns
Discovery scores.
- Return type
1D array
- class mat_discover.mat_discover_.MEGNetWrapper(epochs=1000, r_cutoff=10, nfeat_bond=100)[source]
Bases:
object
- mat_discover.mat_discover_.cdf_sorting_error(y_true, y_pred, y_dummy=None)[source]
Cumulative distribution function sorting error via Wasserstein distance.
- Parameters
y_true (list of (float or int or str)) – True and predicted values to use for sorting error, respectively.
y_pred (list of (float or int or str)) – True and predicted values to use for sorting error, respectively.
y_dummy (list of (float or int or str), optional) – Dummy values to use to generate a scaled error, by default None
- Returns
error, dummy_error, scaled_error – The unscaled, dummy, and scaled errors that describes the mismatch in sorting between the CDFs of two lists. The scaled error represents the improvement relative to the dummy error, such that scaled_error = error / dummy_error. If scaled_error > 1, the sorting error is worse than if you took the average of the y_true values as the y_pred values. If scaled_error < 1, it is better than this “dummy” regressor. Scaled errors closer to 0 are better.
- Return type
float
mat_discover.adaptive_design_ module
- class mat_discover.adaptive_design.Adapt(train_df, val_df, **Discover_kwargs)[source]
Bases:
Discover
- __init__(train_df, val_df, **Discover_kwargs)[source]
Initialize a Discover() class.
- Parameters
timed (bool, optional) – Whether or not timing is reported, by default True
dens_lambda (float, optional) – “Controls the regularization weight of the density correlation term in densMAP. Higher values prioritize density preservation over the UMAP objective, and vice versa for values closer to zero. Setting this parameter to zero is equivalent to running the original UMAP algorithm.” Source: https://umap-learn.readthedocs.io/en/latest/api.html, by default 1.0
plotting (bool, optional) – Whether to create and save various compound-wise and cluster-wise figures, by default False
pdf (bool, optional) – Whether or not probability density function values are computed, by default True
n_peak_neighbors (int, optional) – Number of neighbors to consider when computing k_neigh_avg (i.e. peak proxy), by default 10
verbose (bool, optional) – Whether to print verbose information, by default True
dummy_run (bool, optional) – Whether to use MDS instead of UMAP to run quickly for small datasets. Note that MDS takes longer for UMAP for large datasets, by default False
Scaler (str or class, optional) – Scaler to use for weighted_score (i.e. weighted score of target and proxy values) Target and proxy are separately scaled using Scaler before taking the weighted sum. Possible values are “MinMaxScaler”, “StandardScaler”, “RobustScaler”, or an sklearn.preprocessing scaler class, by default RobustScaler.
figure_dir (str, optional) – Relative or absolute path to directory at which to save figures or tables, by default “figures” and “tables”, respectively. The directory will be created if it does not exist already. if dummy_run then append “dummy” to the folder via os.path.join.
table_dir (str, optional) – Relative or absolute path to directory at which to save figures or tables, by default “figures” and “tables”, respectively. The directory will be created if it does not exist already. if dummy_run then append “dummy” to the folder via os.path.join.
target_unit (Optional[str]) – Unit of target to use in various x, y, and color axes labels. If None, don’t add a unit to the labels. By default None.
use_plotly_offline (bool) – Whether to use offline.plot(fig) instead of fig.show(). Set to False for Google Colab. By default, True.
pred_weight (int, optional) – Weighting applied to the predicted, scaled target values, by default 1 (i.e. equal weighting between predictions and proxies). For example, to weight the predicted targets at twice that of the proxy values, set to 2 (while keeping the default of proxy_weight = 1)
novelty_learner (str or sklearn Regressor, optional) – Whether to use the DiSCoVeR algorithm (“discover”) or another learner for novelty detection (e.g. sklearn.neighbors.LocalOutlierFactor). By default “discover”.
novelty_prop (str, optional) – Which featurization scheme to use for determining novelty. “mod_petti” is is currently the only supported/tested option for the DiSCoVeR novelty_learner for speed considerations, though the other “linear” featurizers should technically be compatible (untested). The “vector” featurizers can be implemented, although with some code plumbing needed. See ElM2D [1]_ and ElMD supported featurizers [2]_. Possible options for sklearn-type novelty_learner-s are those supported by the CBFV [3]_ package (and assuming that all elements that appear in train/val datasets are supported). By default “mod_petti”.
proxy_weight (int, optional) – Weighting applied to the predicted, scaled proxy values, by default 1 (i.e. equal weighting between predictions and proxies when using default pred_weight = 1). For example, to weight the predicted, scaled targets at twice that of the proxy values, set to 2 while retaining pred_weight = 1.
nscores (int, optional) – Number of scores (i.e. compounds) to return in the CSV output files.
regressor (instantiated class, optional) –
The regressor to use for predicting target values, e.g., CrabNet(), or CrabNet(epochs=40) (may be useful to decrease # epochs for smaller datasets). See CrabNet() API. Can be another instantiated class, which at minimum contains fit(dataframe) and predict(dataframe) methods, where dataframe is a pandas DataFrame with at minimum columns (“formula” or “structure”) and “target”. If None, then defaults to CrabNet() By default None.
use_structure (bool, optional) – Whether to use structure-based featurization instead of formula-based. If use_structure is False and regressor is None and mapper is None, then CrabNet and ElMD are used as the regressor and mapper, respectively. If use_structure is True and regressor is None and mapper is None, then M3GNet and GridRDF are used as the regressor and mapper, respectively. By default False.
umap_cluster_kwargs (dict, optional) –
umap.UMAP kwargs that are passed directly into the UMAP embedder that is used for clustering and visualization, respectively. By default None. See basic UMAP parameters and the UMAP API. If this contains dens_lambda key, the value in the Discover class kwarg will take precedence.
umap_vis_kwargs (dict, optional) –
umap.UMAP kwargs that are passed directly into the UMAP embedder that is used for clustering and visualization, respectively. By default None. See basic UMAP parameters and the UMAP API. If this contains dens_lambda key, the value in the Discover class kwarg will take precedence.
hdbscan_kwargs (dict, optional) –
hdbscan.HDBSCAN kwargs that are passed directly into the HDBSCAN clusterer. By default, None. See Parameter Selection for HDBSCAN* and the HDBSCAN API. If
min_cluster_size
is not specified, defaults to 50. Ifmin_samples
is not specified, defaults to 1. Ifcluster_selection_epsilon
is not specified, defaults to 0.63.
References
- closed_loop_adaptive_design(n_experiments=900, extraordinary_thresh=None, extraordinary_quantile=0.98, **suggest_next_experiment_kwargs)[source]