API¶

class schema.SchemaQP(min_desired_corr=0.99, mode='affine', params={})[source]¶

Bases: object

Schema is a tool for integrating simultaneously-assayed data modalities

The SchemaQP class provides a sklearn type fit+transform API for constrained affine transformations of input datasets such that the transformed data is in agreement with all the input datasets.

Parameters:

min_desired_corr (float in [0,1)) –
This parameter controls the severity of the primary modality’s transformation, specifying the minimum required correlation between distances in the original space and those in the transformed space. It thus controls the trade-off between deviating further away from the primary modality’s original representation and achieving greater agreement with the secondary modalities. Values close to one result in lower distortion of the primary modality while those close to zero enable transformations offering greater agreement between the modalities.

RECOMMENDED VALUES: In typical single-cell use cases, high values (> 0.80) will probably work best. With these, the distortion will be low, but still be enough for Schema to extract relevant information from the secondary modalities. Furthermore, the feature weights computed by Schema should still be quite infromative.

The default value of 0.99 is a safe choice to start with; it poses low risk of deviating too far from the primary modality.

Later, you can experiment with a range of values (e.g., 0.95 0.90, 0.80), or use feature-weights aggregated across an ensemble of choices. Alternatively, you can use cross-validation to identify the best setting
mode (string) –
Whether to perform a general affine transformation or just a scaling transformation
- affine first does a mapping to PCA or NMF space (you can specify num_top_components via the params argument). Schema does a scaling transform in the mapped space and then converts everything back to the regular space. The final result is thus an affine transformation in the regular space.
- scale does not do a PCA or NMF mapping, and directly applies the scaling transformation. Note: This can be slow if the primary modality’s dimensionality is over 100.
RECOMMENDED VALUES: affine is the default. You may need scale only in certain cases:
- You have a limited number of features on which you directly want Schema to compute feature-weights.
- You want to do a change-of-basis transform other PCA or NMF. If so, you will need to do that yourself and then call SchemaQP with the transformed primary dataset with mode=’scale’.
params (dict) –
Dictionary of key-value pairs, specifying additional configuration parameters. Here are the important ones:
- decomposition_model: “pca” or “nmf” (default=pca)
- num_top_components: (default=50) number of PCA (or NMF) components to use when mode==”affine”. We recommend this setting be <= 100. Schema’s runtime is quadratic in this number.
You can ignore the rest on your first pass; the default values are pretty reasonable:
- dist_npairs: (default=2000000). How many pt-pairs to use for computing pairwise distances. value=None means compute exhaustively over all n*(n-1)/2 pt-pairs. Not recommended for n>5000. Otherwise, the given number of pt-pairs is sampled randomly. The sampling is done in a way in which each point will be represented roughly equally.
- scale_mode_uses_standard_scaler: 1 or 0 (default=0), apply the standard scaler in the scaling mode
- do_whiten: 1 or 0 (default=1). When mode==”affine”, should the change-of-basis loadings be made 1-variance?

Returns:

A SchemaQP object on which you can call fit(…), transform(…) or fit_transform(….).

explore_param_mincorr(d, secondary_data_val_list, secondary_data_type_list, secondary_data_wt_list=None, min_desired_corr_values=[0.999, 0.99, 0.95, 0.9, 0.8, 0.5, 0.2, 0], addl_fit_kwargs={}, addl_feature_weights_kwargs={})[source]¶

Helper function to explore multiple choices of the min_desired_corr param.

For a range of min_desired_corr parameter values, it performs a fit, gets the feature_weights, and also the achieved distance correlation between the transformed data and the primary/secondary modalities. While this method is simply a convenience wrapper around other public methods, it is nonetheless useful for exploring the best choice of min_desired_corr for your application. For example, if you’re doing batch correction and hence set a secondary modality’s wt to be negative, you want the distance correlation of batch information and transformed data to go to zero, not beyond that into negative correlation territory. This function can help you identify an appropriate min_desired_corr value.

The required arguments are the same as those for a call to fit (which this method calls, under the hood). The default list of possible values for min_desired_corr is a good place to start.

Parameters:

d (Numpy 2-d array or Pandas dataframe) – Same as in fit: the primary dataset (e.g. scanpy/anndata’s .X).
secondary_data_val_list (list of 1-d or 2-d Numpy arrays or Pandas series, each with same number of rows as d) – Same as in fit: the secondary datasets you want to align the primary data towards.
secondary_data_type_list (list of strings) – Same as in fit: the datatypes of the secondary modalities.
secondary_data_wt_list (list of floats, optional) – Same as in fit: user-specified wts for each secondary dataset (default= list of 1’s)
min_desired_corr_values (list of floats, each value v being 0 <= v < 1) – list of min_desired_corr values to explore. The default is [0.999, 0.99, 0.95, 0.9, 0.8, 0.5, 0.2, 0]
addl_fit_kwargs (dict) – additional named arguments passed on to fit(…)
addl_feature_weights_kwargs (dict) – named arguments passed on to feature_weights(…)

Returns:

a tuple with 4 entries. In the first 3 below, each row of the dataframe corresponds to a min_desired_corr value

Dataframe of starting and ending distance correlations (see get_start_end_dist_correlations for details)
Dataframe of feature weights, produced by a call to feature_weights
Dataframe of QP solution wts. Same as feature weights if mode=’scale’, otherwise this corresponds to the QP-computed wts in the PCA/NMF space
Dictionary of SchemaQP objects, keyed by the min_desired_corr parameter. You can use them for transform calls.

feature_weights(affine_map_style='top-k-loading', k=1)[source]¶

Return the feature weights computed by Schema

If SchemaQP was initialized with mode=scale, the weights returned are directly the weights from the quadratic programming (QP), with a weight > 1 indicating the feature was up-weighted. The affine_map_style argument is ignored.

However, if mode=affine was used, the QP-computed weights correspond to columns of the PCA or NMF decomposition. In that case, this functions maps them back to the primary dataset’s features. This can be done in three different ways, as specified by the affine_map_style parameter.

You can build your own mapping from PCA/NMF weights to primary-modality feature weights. The instance’s _wts member is the numpy array that contains QP-computed weights, and _decomp_mdl is the sklearn-computed NMF/PCA decomposition. You can also look at the source code of this function to get a sense of how to use them.

Parameters:

affine_map_style (string, one of 'softmax-avg' or 'top-k-rank' or 'top-k-loading', default='top-k-loading') –
Governs how QP-computed weights for PCA/NMF columns are mapped back to primary-modality features (typically, genes from a scRNA-seq dataset).

Default is ‘top-k-loading’, which considers only the top-k PCA/NMF columns by QP-computed weight and computes the average loading of a gene across these. The second argument specifies k (default=1)

Another choice is ‘softmax-avg’, which computes gene weights by a softmax-type summation of loadings across the PCA/NMF columns, with each column’s weight proportional to exp(QP wt), and only columns with QP weight > 1 being considered. k is ignored here.

Yet another choice is ‘top-k-rank’, which considers only the top-k PCA/NMF columns by QP-computed weight and computes the average rank of a gene across their loadings. The second argument specifies k (default=1)

In all approaches, PCA loadings are first converted to absolute value, since PCA columns are unique up to a sign.
k (int, >= 0) – The number of PCA/NMF columns to average over, when affine_map_style = top-k-loading or top-k-rank.

returns : a vector of floats, the same size as the primary dataset’s dimensionality

fit(d, secondary_data_val_list, secondary_data_type_list, secondary_data_wt_list=None, secondary_data_dist_kernels=None, d0=None, d0_dist_transform=None)[source]¶

Compute the optimal Schema transformation, first performing a change-of-basis transformation if required.

Given the primary dataset d and a list of secondary datasets, fit a linear transformation (d_new) such that the correlation between squared pairwise distances in d_new and those in secondary datasets is maximized while the correlation between the original d and the transformed d_new remains above min_desired_corr.

The first three arguments are required, the next is useful, and the rest should be rarely used.

Parameters:

d (Numpy 2-d array or Pandas dataframe) –
The primary dataset (e.g. scanpy/anndata’s .X).

The rows are observations (e.g., cells) and the cols are variables (e.g., gene expression). The default distance measure computed is L2: sum((point1-point2)**2). Also see d0_dist_transform.
secondary_data_val_list (list of 1-d or 2-d Numpy arrays or Pandas series, each with same number of rows as d) –
The secondary datasets you want to align the primary data towards.

Columns in Anndata .obs or .obsm variables work well.
secondary_data_type_list (list of strings) –
The datatypes of the secondary modalities.

Each element of the list can be one of numeric, feature_vector, categorical, feature_vector_categorical. The list’s length should match the length of secondary_data_val_list
- numeric: one floating-pt value for each observation. The default distance measure is Euclidean: (point1-point2)^2
- feature_vector: a k-dimensional vector for each observation. The default distance measure is Euclidean: sum_{i}((point1[i]-point2[i])^2)
- categorical: a label for each observation. The default distance measure checks for equality: 1*(val1!=val2)
- feature_vector_categorical: a vector of labels for each observation. Each column can take on categorical values, so the distance between two points is sum_{i}(point1[i]==point2[i])
secondary_data_wt_list (list of floats, optional) –
User-specified wts for each secondary dataset (default= list of 1’s)

If specified, the list’s length should match the length of secondary_data_val_list. When multiple secondary modalities are specified, this parameter allows you to control their relative weight in seeking an agreement with the primary.

Note: you can try to get a mapping that disagrees with a dataset_info instead of agreeing. To do so, pass in a negative number (e.g., -1) here. This works even if you have just one secondary dataset
secondary_data_dist_kernels (list of functions, optional) –
The transformations to apply on secondary dataset’s L2 distances before using them for correlations.

If specified, the length of the list should match that of secondary_data_val_list. Each function should take a non-negative float and return a non-negative float.

Handle with care: Most likely, you don’t need this parameter.
d0 (A 1-d or 2-d Numpy array, optional) –
An alternative representation of the primary dataset.

This is useful if you want to provide the primary dataset in two forms: one for transforming and another one for computing pairwise distances to use in the QP constraint; if so, d is used for the former, while d0 is used for the latter. If specified, it should have the same number of rows as d.

Handle with care: Most likely, you don’t need this parameter.
d0_dist_transform (float -> float function, optional) –
The transformation to apply on d or d0’s L2 distances before using them for correlations.

This function should take a non-negative float as input and return a non-negative float.

Handle with care: Most likely, you don’t need this parameter.

Returns:

None

fit_transform(d, secondary_data_val_list, secondary_data_type_list, secondary_data_wt_list=None, secondary_data_dist_kernels=None, d0=None, d0_dist_transform=None)[source]¶: Calls fit(..) with exactly the arguments given; then calls transform(d). See documentation for fit(….) and transform(…) respectively.

get_start_end_dist_correlations()[source]¶

Return the starting and ending distance correlations between primary and secondary modalities

Note: the distance correlations reported out (even between the primary and secondary modalities) may vary from run to run, since the underlying algorithm samples a set of point pairs to compute its estimates.

Returns:

a tuple with 3 entries:

a) distance correlation between primary and transformed space. This should always be >= min_desired_corr but it can be substantially greater than min_desired_corr if the optimal solution requires that.

vector of distance correlations between primary and secondary modalities, and
vector of distance correlations between transformed dataset and secondary modalities

reset_maxwt_param(w_max_to_avg)[source]¶

Reset the w_max_to_avg param

Parameters:

w_max_to_avg (float) –

The upper-bound on the ratio of Schema weights (w’s) largest element to w’s avg element. Making it large will allow This parameter controls the ‘deviation’ in feature weights and make it large will allow for more severe transformations.

Handle with care: We recommend keeping this parameter at its default value (1000); that keeps this constraint very loose and ensures that min_desired_corr remains the binding constraint. Later, as you get a better sense for the right min_desired_corr values for your data, you can experiment with this too. To really constrain this, set it in the (1-5] range, depending on how many features you have.

reset_mincorr_param(min_desired_corr)[source]¶

Reset the min_desired_corr.

Useful when you want to iterate over multiple choices of this parameter but want to re-use the computed PCA or NMF change-of-basis transform.

Parameters:	min_desired_corr (float in [0,1)) – The new value of minimum required correlation between original and transformed distances

transform(d)[source]¶

Given a dataset d, apply the fitted transform to it

Parameters:

d (Numpy 2-d array) –

The primary modality data on which to apply the transformation.

d must have with same number of columns as in fit(…). The rows are observations (e.g., cells) and the cols are variables (e.g., gene expression).

Returns: a 2-d Numpy array with the same shape as d