mat_discover API

Module contents

Data-driven materials discovery based on composition or structure.

Submodules

mat_discover.mat_discover_ module

Materials discovery using Earth Mover’s Distance, DensMAP embeddings, and HDBSCAN*.

Create distance matrix, apply densMAP, and create clusters via HDBSCAN* to search for interesting materials. For example, materials with high-target/low-density (density proxy) or high-target surrounded by materials with low targets (peak proxy).

class mat_discover.mat_discover_.CDVAECovStructFingerprintWrapper[source]

Bases: object

__init__()[source]
fit(structures)[source]
class mat_discover.mat_discover_.CrabNetPretendCrystalWrapper(epochs=300)[source]

Bases: object

__init__(epochs=300)[source]
fit(train_df)[source]
predict(val_df)[source]
class mat_discover.mat_discover_.Discover(timed: bool = True, dens_lambda: float = 1.0, plotting: bool = False, pdf: bool = True, n_peak_neighbors: int = 10, radius=None, verbose: bool = True, dummy_run: bool = False, Scaler=<class 'sklearn.preprocessing._data.RobustScaler'>, figure_dir: ~typing.Union[str, ~os.PathLike] = 'figures', table_dir: ~typing.Union[str, ~os.PathLike] = 'tables', target_unit: ~typing.Optional[str] = None, use_plotly_offline: bool = True, mapper=None, novelty_learner: str = 'discover', novelty_prop: str = 'mod_petti', pred_weight: float = 1.0, proxy_weight: float = 1.0, nscores: int = 100, regressor=None, use_structure: bool = False, umap_cluster_kwargs: ~typing.Optional[~typing.MutableMapping] = None, umap_vis_kwargs: ~typing.Optional[~typing.MutableMapping] = None, hdbscan_kwargs: ~typing.Optional[~typing.MutableMapping] = None)[source]

Bases: object

A Materials Discovery class.

Uses chemical-based distances, dimensionality reduction, clustering, and plotting to search for high performing, chemically unique compounds relative to training data.

__init__(timed: bool = True, dens_lambda: float = 1.0, plotting: bool = False, pdf: bool = True, n_peak_neighbors: int = 10, radius=None, verbose: bool = True, dummy_run: bool = False, Scaler=<class 'sklearn.preprocessing._data.RobustScaler'>, figure_dir: ~typing.Union[str, ~os.PathLike] = 'figures', table_dir: ~typing.Union[str, ~os.PathLike] = 'tables', target_unit: ~typing.Optional[str] = None, use_plotly_offline: bool = True, mapper=None, novelty_learner: str = 'discover', novelty_prop: str = 'mod_petti', pred_weight: float = 1.0, proxy_weight: float = 1.0, nscores: int = 100, regressor=None, use_structure: bool = False, umap_cluster_kwargs: ~typing.Optional[~typing.MutableMapping] = None, umap_vis_kwargs: ~typing.Optional[~typing.MutableMapping] = None, hdbscan_kwargs: ~typing.Optional[~typing.MutableMapping] = None)[source]

Initialize a Discover() class.

Parameters
  • timed (bool, optional) – Whether or not timing is reported, by default True

  • dens_lambda (float, optional) – “Controls the regularization weight of the density correlation term in densMAP. Higher values prioritize density preservation over the UMAP objective, and vice versa for values closer to zero. Setting this parameter to zero is equivalent to running the original UMAP algorithm.” Source: https://umap-learn.readthedocs.io/en/latest/api.html, by default 1.0

  • plotting (bool, optional) – Whether to create and save various compound-wise and cluster-wise figures, by default False

  • pdf (bool, optional) – Whether or not probability density function values are computed, by default True

  • n_peak_neighbors (int, optional) – Number of neighbors to consider when computing k_neigh_avg (i.e. peak proxy), by default 10

  • verbose (bool, optional) – Whether to print verbose information, by default True

  • dummy_run (bool, optional) – Whether to use MDS instead of UMAP to run quickly for small datasets. Note that MDS takes longer for UMAP for large datasets, by default False

  • Scaler (str or class, optional) – Scaler to use for weighted_score (i.e. weighted score of target and proxy values) Target and proxy are separately scaled using Scaler before taking the weighted sum. Possible values are “MinMaxScaler”, “StandardScaler”, “RobustScaler”, or an sklearn.preprocessing scaler class, by default RobustScaler.

  • figure_dir (str, optional) – Relative or absolute path to directory at which to save figures or tables, by default “figures” and “tables”, respectively. The directory will be created if it does not exist already. if dummy_run then append “dummy” to the folder via os.path.join.

  • table_dir (str, optional) – Relative or absolute path to directory at which to save figures or tables, by default “figures” and “tables”, respectively. The directory will be created if it does not exist already. if dummy_run then append “dummy” to the folder via os.path.join.

  • target_unit (Optional[str]) – Unit of target to use in various x, y, and color axes labels. If None, don’t add a unit to the labels. By default None.

  • use_plotly_offline (bool) – Whether to use offline.plot(fig) instead of fig.show(). Set to False for Google Colab. By default, True.

  • pred_weight (int, optional) – Weighting applied to the predicted, scaled target values, by default 1 (i.e. equal weighting between predictions and proxies). For example, to weight the predicted targets at twice that of the proxy values, set to 2 (while keeping the default of proxy_weight = 1)

  • novelty_learner (str or sklearn Regressor, optional) – Whether to use the DiSCoVeR algorithm (“discover”) or another learner for novelty detection (e.g. sklearn.neighbors.LocalOutlierFactor). By default “discover”.

  • novelty_prop (str, optional) – Which featurization scheme to use for determining novelty. “mod_petti” is is currently the only supported/tested option for the DiSCoVeR novelty_learner for speed considerations, though the other “linear” featurizers should technically be compatible (untested). The “vector” featurizers can be implemented, although with some code plumbing needed. See ElM2D [1]_ and ElMD supported featurizers [2]_. Possible options for sklearn-type novelty_learner-s are those supported by the CBFV [3]_ package (and assuming that all elements that appear in train/val datasets are supported). By default “mod_petti”.

  • proxy_weight (int, optional) – Weighting applied to the predicted, scaled proxy values, by default 1 (i.e. equal weighting between predictions and proxies when using default pred_weight = 1). For example, to weight the predicted, scaled targets at twice that of the proxy values, set to 2 while retaining pred_weight = 1.

  • nscores (int, optional) – Number of scores (i.e. compounds) to return in the CSV output files.

  • regressor (instantiated class, optional) – The regressor to use for predicting target values, e.g., CrabNet(), or CrabNet(epochs=40) (may be useful to decrease # epochs for smaller datasets). See CrabNet() API. Can be another instantiated class, which at minimum contains fit(dataframe) and predict(dataframe) methods, where dataframe is a pandas DataFrame with at minimum columns (“formula” or “structure”) and “target”. If None, then defaults to CrabNet() By default None.

  • use_structure (bool, optional) – Whether to use structure-based featurization instead of formula-based. If use_structure is False and regressor is None and mapper is None, then CrabNet and ElMD are used as the regressor and mapper, respectively. If use_structure is True and regressor is None and mapper is None, then M3GNet and GridRDF are used as the regressor and mapper, respectively. By default False.

  • umap_cluster_kwargs (dict, optional) – umap.UMAP kwargs that are passed directly into the UMAP embedder that is used for clustering and visualization, respectively. By default None. See basic UMAP parameters and the UMAP API. If this contains dens_lambda key, the value in the Discover class kwarg will take precedence.

  • umap_vis_kwargs (dict, optional) –

    umap.UMAP kwargs that are passed directly into the UMAP embedder that is used for clustering and visualization, respectively. By default None. See basic UMAP parameters and the UMAP API. If this contains dens_lambda key, the value in the Discover class kwarg will take precedence.

  • hdbscan_kwargs (dict, optional) – hdbscan.HDBSCAN kwargs that are passed directly into the HDBSCAN clusterer. By default, None. See Parameter Selection for HDBSCAN* and the HDBSCAN API. If min_cluster_size is not specified, defaults to 50. If min_samples is not specified, defaults to 1. If cluster_selection_epsilon is not specified, defaults to 0.63.

References

1

https://github.com/lrcfmd/ElM2D

2

https://github.com/lrcfmd/ElMD/tree/v0.4.7#elemental-similarity

3

https://github.com/kaaiian/CBFV

cluster(umap_emb, min_cluster_size=50, min_samples=1)[source]

Cluster using HDBSCAN*.

Parameters
  • umap_emb (nD Array) – DensMAP embedding coordinates.

  • min_cluster_size (int, optional) – “The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters.” (source: HDBSCAN* docs), by default 50

  • min_samples (int, optional) – “The number of samples in a neighbourhood for a point to be considered a core point.” (source: HDBSCAN* docs), by default 1

Returns

clusterer – HDBSCAN clusterer fitted to UMAP embeddings.

Return type

HDBSCAN class

cluster_count_hist()[source]

Histogram of cluster counts colored by cluster label.

compute_log_density(r_orig=None)[source]

Compute the log density based on the radii.

Parameters

r_orig (1d array, optional) – The original radii associated with the fitted DensMAP, by default None. If None, then defaults to self.std_r_orig.

Returns

self.dens, self.log_dens – Densities and log densities associated with the original radii, respectively.

Return type

1d array

Notes

Density is approximated as 1/r_orig

data(module, **data_kwargs)[source]

Grab data from within the subdirectories (modules) of mat_discover.

Parameters
  • module (Module) – The module within mat_discover that contains e.g. “train.csv”. For example, from crabnet.data.materials_data import elasticity

  • fname (str, optional) – Filename of text file to open.

  • dummy (bool, optional) – Whether to pare down the data to a small test set, by default False

  • groupby (bool, optional) – Whether to use groupby_formula to filter identical compositions

  • split (bool, optional) – Whether to split the data into train, val, and (optionally) test sets, by default True

  • val_size (float, optional) – Validation dataset fraction, by default 0.2

  • test_size (float, optional) – Test dataset fraction, by default 0.0

  • random_state (int, optional) – seed to use for the train/val/test split, by default 42

Returns

  • DataFrame – If split==False, then the full DataFrame is returned directly

  • DataFrame, DataFrame – If test_size == 0 and split==True, then training and validation DataFrames are returned.

  • DataFrame, DataFrame, DataFrame – If test_size > 0 and split==True, then training, validation, and test DataFrames are returned.

dens_scatter()[source]

Density scatter plot, with densities computed via probability density fn.

dens_targ_scatter()[source]

Target value scatter plot (colored by target value) overlay on densities.

extract_emb_rad(trans)[source]

Extract densMAP embedding and radii.

Parameters

trans (class) – A fitted UMAP class.

Returns

  • emb – UMAP embedding

  • r_orig – original radii

  • r_emb – embedded radii

See also

umap.UMAP

UMAP class.

extract_labels_probs(clusterer)[source]

Extract cluster IDs (labels) and probabilities from HDBSCAN* clusterer.

Parameters

clusterer (HDBSCAN class) – Instantiated HDBSCAN* class for clustering.

Returns

  • labels_ (ndarray, shape (n_samples, )) – “Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1.” (source: HDBSCAN* docs)

  • probabilities_ (ndarray, shape (n_samples, )) – “The strength with which each sample is a member of its assigned cluster. Noise points have probability zero; points in clusters have values assigned proportional to the degree that they persist as part of the cluster.” (source: HDBSCAN* docs)

fit(train_df)[source]

Fit CrabNet model to training data.

Parameters

train_df (DataFrame) – Should contain (“formula” or “structure”) and “target” columns.

gcv_pareto()[source]

Cluster-wise group cross-validation parity plot.

group_cross_val(df, umap_random_state=None, dummy_run=None)[source]

Perform leave-one-cluster-out cross-validation (LOCO-CV).

Parameters
  • df (DataFrame) – Contains “formula” and “target” (all the data)

  • umap_random_state (int, optional) – Random state to use for DensMAP embedding, by default None

  • dummy_run (bool, optional) – Whether to perform a “dummy run” (i.e. use multi-dimensional scaling which is faster), by default None

Returns

Scaled, weighted error based on Wasserstein distance (i.e. a sorting distance).

Return type

float

Raises

ValueError – Needs to have at least one cluster. It is assumed that there will always be a non-cluster (i.e. unclassified points) if there is only 1 cluster.

Notes

TODO: highest mean vs. highest single target value

load(fpath='disc.pkl')[source]

Load Discover model.

Parameters

fpath (str, optional) – Filepath from which to load, by default “disc.pkl”

Returns

Loaded Discover() model.

Return type

Class

merge(nscores=100)[source]

Perform an outer merge of the density and peak proxy rankings.

Returns

Outer merge of the two proxy rankings.

Return type

DataFrame

mvn_prob_sum(emb, r_orig, n=100)[source]

Gridded multivariate normal probability summation.

Parameters
  • emb (ndarray) – Clustering embedding.

  • r_orig (1d array) – Original DensMAP radii.

  • n (int, optional) – Number of points along the x and y axes (total grid points = n^2), by default 100

Returns

  • x (1d array) – x-coordinates

  • y (1d array) – y-coordinates

  • pdf_sum (1d array) – summed densities at the (x, y) locations

pf_dens_proxy()[source]

True targets vs. dens proxy pareto plot (both training and validation).

pf_frac_proxy()[source]

Cluster-wise average vs. cluster-wise validation fraction Pareto plot.

In other words, the average performance of a cluster vs. cluster novelty.

pf_peak_proxy()[source]

Predicted target vs. peak proxy pareto plot.

Peak proxy gives an idea of how “surprising” the performance is (i.e. a local peak in the ElMD space).

pf_train_contrib_proxy()[source]

Predicted target vs train contribution to validation log density pareto plot.

This is only for the validation data. Training data contribution to validation log density is a proxy for chemical novelty (i.e. how novel is a given validation datapoint relative to the training data).

plot(return_pareto_ind: bool = False)[source]

Plot and save various cluster and Pareto front figures.

Parameters

return_pareto_ind (bool, optional) – Whether to return the pareto front indices, by default False

Returns

pk_pareto_ind, dens_pareto_ind – Pareto front indices for the peak and density proxies, respectively.

Return type

tuple of int

predict(val_df, plotting: Optional[bool] = None, umap_random_state=None, pred_weight=None, proxy_weight=None, dummy_run: Optional[bool] = None, count_repeats: bool = False, return_peak: bool = False)[source]

Predict target and proxy for validation dataset.

Parameters
  • val_df (DataFrame) – Validation dataset containing at minimum (“formula” or “structure”) and optionally “target” (targets are populated with 0’s if not available).

  • plotting (bool, optional) – Whether to plot, by default None

  • umap_random_state (int or None, optional) – The random seed to use for UMAP, by default None

  • pred_weight (int, optional) – The weight to assign to the scaled target predictions (proxy_weight = 1 by default), by default None. If neither pred_weight nor self.pred_weight is specified, it defaults to 1.

  • proxy_weight (int, optional) – The weight to assign to the scaled proxy predictions (pred_weight is 1 by default), by default None. When specified, proxy_weight takes precedence over self.proxy_weight. If neither proxy_weight nor self.proxy_weight is specified, it defaults to 1.

  • dummy_run (bool, optional) – Whether to use MDS in place of the (typically more expensive) DensMAP, by default None. If neither dummy_run nor self.dummy_run is specified, it defaults to (effectively) being False. When specified, dummy_run takes precedence over self.dummy_run.

  • count_repeats (bool, optional) – Whether repeat chemical formulae should intensify the local density (i.e. decrease the novelty) or not. By default False.

  • return_peak (bool, optional) – Whether or not to return the peak scores in addition to the density-based scores. By default, False.

Returns

Scaled discovery scores for density and peak proxies. Returns only dens_score if return_peak is False, which is the default.

Return type

dens_score, peak_score

px_targ_scatter()[source]

Interactive targ_scatter plot.

px_umap_cluster_scatter()[source]

Interactive scatter plot of DensMAP embeddings colored by clusters.

save(fpath='disc.pkl', dummy=False)[source]

Save Discover model.

Parameters

fpath (str, optional) – Filepath to which to save, by default “disc.pkl”

See also

load

load a Discover model.

single_group_cross_val(X, y, train_index, val_index, iter)[source]

Perform leave-one-cluster-out cross-validation.

Parameters
  • X (DataFrame of str or Structure objects) – Chemical formulae or pymatgen Structure objects.

  • y (Series of float) – Target properties.

  • train_index (1d array of int) – Training and validation indices for a given split, respectively.

  • val_index (1d array of int) – Training and validation indices for a given split, respectively.

  • iter (int) – Iteration (i.e. how many clusters have been processed so far).

Returns

true_avg_targ, pred_avg_targ, train_avg_targ – True, predicted, and training average targets for each of the clusters. average target is used to create a “dummy” measure of performance (i.e. one of the the simplest “models” you can use, the average of the training data).

Return type

1d array of float

sort(score, proxy_name='density')[source]

Sort (rank) compounds by their proxy score.

Parameters
  • score (1D Array) – Discovery scores for the given proxy given by proxy_name.

  • proxy_name (string, optional) – Name of the proxy, by default “density”. Possible values are “density” (self.val_dens), “peak” (self.k_neigh_avg), and “radius” (self.val_rad_neigh_avg).

Returns

Contains (“formula” or “structure”), “prediction”, proxy_name, and “score”.

Return type

DataFrame

target_scatter()[source]

Scatter plot of DensMAP embeddings colored by target values.

umap_cluster_scatter()[source]

Static scatter plot colored by clusters.

umap_fit_cluster(dm, metric='precomputed', random_state=None)[source]

Perform DensMAP fitting for clustering.

See https://umap-learn.readthedocs.io/en/latest/clustering.html.

Parameters
  • dm (ndarray) – Pairwise Element Mover’s Distance (ElMD) matrix within a single set of points.

  • metric (str) – Which metric to use for DensMAP, by default “precomputed”.

  • random_state (int, RandomState instance or None, optional (default: None)) – “If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.” (source: UMAP docs)

Returns

umap_trans – A UMAP class fitted to dm.

Return type

UMAP class

See also

umap.UMAP

UMAP class.

umap_fit_vis(dm, random_state=None)[source]

Perform DensMAP fitting for visualization.

See https://umap-learn.readthedocs.io/en/latest/clustering.html.

Parameters
  • dm (ndarray) – Pairwise Element Mover’s Distance (ElMD) matrix within a single set of points.

  • random_state (int, RandomState instance or None, optional (default: None)) – “If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.” (source: UMAP docs)

Returns

std_trans – A UMAP class fitted to dm.

Return type

UMAP class

See also

umap.UMAP

UMAP class.

weighted_score(pred, proxy, pred_weight=None, proxy_weight=None, pred_scaler=None, proxy_scaler=None)[source]

Calculate weighted discovery score using the predicted target and proxy.

Parameters
  • pred (1D Array) – Predicted target property values.

  • proxy (1D Array) – Predicted proxy values (e.g. density or peak proxies).

  • pred_weight (int, optional) – The weight to assign to the scaled predictions, by default 1

  • proxy_weight (int, optional) – The weight to assign to the scaled proxies, by default 1

Returns

Discovery scores.

Return type

1D array

class mat_discover.mat_discover_.M3GNetWrapper(epochs=1000)[source]

Bases: object

__init__(epochs=1000)[source]
fit(train_df)[source]
predict(val_df)[source]
class mat_discover.mat_discover_.MEGNetWrapper(epochs=1000, r_cutoff=10, nfeat_bond=100)[source]

Bases: object

__init__(epochs=1000, r_cutoff=10, nfeat_bond=100)[source]
fit(train_df)[source]
predict(val_df)[source]
mat_discover.mat_discover_.cdf_sorting_error(y_true, y_pred, y_dummy=None)[source]

Cumulative distribution function sorting error via Wasserstein distance.

Parameters
  • y_true (list of (float or int or str)) – True and predicted values to use for sorting error, respectively.

  • y_pred (list of (float or int or str)) – True and predicted values to use for sorting error, respectively.

  • y_dummy (list of (float or int or str), optional) – Dummy values to use to generate a scaled error, by default None

Returns

error, dummy_error, scaled_error – The unscaled, dummy, and scaled errors that describes the mismatch in sorting between the CDFs of two lists. The scaled error represents the improvement relative to the dummy error, such that scaled_error = error / dummy_error. If scaled_error > 1, the sorting error is worse than if you took the average of the y_true values as the y_pred values. If scaled_error < 1, it is better than this “dummy” regressor. Scaled errors closer to 0 are better.

Return type

float

mat_discover.mat_discover_.my_mvn(mu_x, mu_y, r)[source]

Calculate multivariate normal at (mu_x, mu_y) with constant radius, r.

mat_discover.adaptive_design_ module

class mat_discover.adaptive_design.Adapt(train_df, val_df, **Discover_kwargs)[source]

Bases: Discover

__init__(train_df, val_df, **Discover_kwargs)[source]

Initialize a Discover() class.

Parameters
  • timed (bool, optional) – Whether or not timing is reported, by default True

  • dens_lambda (float, optional) – “Controls the regularization weight of the density correlation term in densMAP. Higher values prioritize density preservation over the UMAP objective, and vice versa for values closer to zero. Setting this parameter to zero is equivalent to running the original UMAP algorithm.” Source: https://umap-learn.readthedocs.io/en/latest/api.html, by default 1.0

  • plotting (bool, optional) – Whether to create and save various compound-wise and cluster-wise figures, by default False

  • pdf (bool, optional) – Whether or not probability density function values are computed, by default True

  • n_peak_neighbors (int, optional) – Number of neighbors to consider when computing k_neigh_avg (i.e. peak proxy), by default 10

  • verbose (bool, optional) – Whether to print verbose information, by default True

  • dummy_run (bool, optional) – Whether to use MDS instead of UMAP to run quickly for small datasets. Note that MDS takes longer for UMAP for large datasets, by default False

  • Scaler (str or class, optional) – Scaler to use for weighted_score (i.e. weighted score of target and proxy values) Target and proxy are separately scaled using Scaler before taking the weighted sum. Possible values are “MinMaxScaler”, “StandardScaler”, “RobustScaler”, or an sklearn.preprocessing scaler class, by default RobustScaler.

  • figure_dir (str, optional) – Relative or absolute path to directory at which to save figures or tables, by default “figures” and “tables”, respectively. The directory will be created if it does not exist already. if dummy_run then append “dummy” to the folder via os.path.join.

  • table_dir (str, optional) – Relative or absolute path to directory at which to save figures or tables, by default “figures” and “tables”, respectively. The directory will be created if it does not exist already. if dummy_run then append “dummy” to the folder via os.path.join.

  • target_unit (Optional[str]) – Unit of target to use in various x, y, and color axes labels. If None, don’t add a unit to the labels. By default None.

  • use_plotly_offline (bool) – Whether to use offline.plot(fig) instead of fig.show(). Set to False for Google Colab. By default, True.

  • pred_weight (int, optional) – Weighting applied to the predicted, scaled target values, by default 1 (i.e. equal weighting between predictions and proxies). For example, to weight the predicted targets at twice that of the proxy values, set to 2 (while keeping the default of proxy_weight = 1)

  • novelty_learner (str or sklearn Regressor, optional) – Whether to use the DiSCoVeR algorithm (“discover”) or another learner for novelty detection (e.g. sklearn.neighbors.LocalOutlierFactor). By default “discover”.

  • novelty_prop (str, optional) – Which featurization scheme to use for determining novelty. “mod_petti” is is currently the only supported/tested option for the DiSCoVeR novelty_learner for speed considerations, though the other “linear” featurizers should technically be compatible (untested). The “vector” featurizers can be implemented, although with some code plumbing needed. See ElM2D [1]_ and ElMD supported featurizers [2]_. Possible options for sklearn-type novelty_learner-s are those supported by the CBFV [3]_ package (and assuming that all elements that appear in train/val datasets are supported). By default “mod_petti”.

  • proxy_weight (int, optional) – Weighting applied to the predicted, scaled proxy values, by default 1 (i.e. equal weighting between predictions and proxies when using default pred_weight = 1). For example, to weight the predicted, scaled targets at twice that of the proxy values, set to 2 while retaining pred_weight = 1.

  • nscores (int, optional) – Number of scores (i.e. compounds) to return in the CSV output files.

  • regressor (instantiated class, optional) –

    The regressor to use for predicting target values, e.g., CrabNet(), or CrabNet(epochs=40) (may be useful to decrease # epochs for smaller datasets). See CrabNet() API. Can be another instantiated class, which at minimum contains fit(dataframe) and predict(dataframe) methods, where dataframe is a pandas DataFrame with at minimum columns (“formula” or “structure”) and “target”. If None, then defaults to CrabNet() By default None.

  • use_structure (bool, optional) – Whether to use structure-based featurization instead of formula-based. If use_structure is False and regressor is None and mapper is None, then CrabNet and ElMD are used as the regressor and mapper, respectively. If use_structure is True and regressor is None and mapper is None, then M3GNet and GridRDF are used as the regressor and mapper, respectively. By default False.

  • umap_cluster_kwargs (dict, optional) –

    umap.UMAP kwargs that are passed directly into the UMAP embedder that is used for clustering and visualization, respectively. By default None. See basic UMAP parameters and the UMAP API. If this contains dens_lambda key, the value in the Discover class kwarg will take precedence.

  • umap_vis_kwargs (dict, optional) –

    umap.UMAP kwargs that are passed directly into the UMAP embedder that is used for clustering and visualization, respectively. By default None. See basic UMAP parameters and the UMAP API. If this contains dens_lambda key, the value in the Discover class kwarg will take precedence.

  • hdbscan_kwargs (dict, optional) –

    hdbscan.HDBSCAN kwargs that are passed directly into the HDBSCAN clusterer. By default, None. See Parameter Selection for HDBSCAN* and the HDBSCAN API. If min_cluster_size is not specified, defaults to 50. If min_samples is not specified, defaults to 1. If cluster_selection_epsilon is not specified, defaults to 0.63.

References

1

https://github.com/lrcfmd/ElM2D

2

https://github.com/lrcfmd/ElMD/tree/v0.4.7#elemental-similarity

3

https://github.com/kaaiian/CBFV

closed_loop_adaptive_design(n_experiments=900, extraordinary_thresh=None, extraordinary_quantile=0.98, **suggest_next_experiment_kwargs)[source]
suggest_first_experiment(proxy_name='density', random_search=False, fit=True, print_experiment=True, **predict_kwargs)[source]
suggest_next_experiment(proxy_name='density', fit=True, predict=False, random_search=False, print_experiment=True, **predict_kwargs)[source]
class mat_discover.adaptive_design.DummyCrabNet[source]

Bases: object

__init__()[source]
fit(train_df)[source]
predict(val_df)[source]
mat_discover.adaptive_design.ad_experiments_metrics(experiments, train_df, extraordinary_thresh)[source]
mat_discover.adaptive_design.ad_metrics(experiments, init_train_df, extraordinary_thresh)[source]

Subpackages