`diffprivlib.models`¶

Machine learning models with differential privacy

Classification models¶

Gaussian Naive Bayes¶

class diffprivlib.models.GaussianNB(*, epsilon=1.0, bounds=None, priors=None, var_smoothing=1e-09, accountant=None)[source]¶

Gaussian Naive Bayes (GaussianNB) with differential privacy

Inherits the sklearn.naive_bayes.GaussianNB class from Scikit Learn and adds noise to satisfy differential privacy to the learned means and variances. Adapted from the work presented in [VSB13].

Parameters

epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\) for the model.
bounds (tuple, optional) – Bounds of the data, provided as a tuple of the form (min, max). min and max can either be scalars, covering the min/max of the entire data, or vectors with one entry per feature. If not provided, the bounds are computed on the data when .fit() is first called, resulting in a PrivacyLeakWarning.
priors (array-like, shape (n_classes,)) – Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
var_smoothing (float, default: 1e-9) – Portion of the largest variance of all features that is added to variances for calculation stability.
accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

class_prior_¶

probability of each class.

Type: array, shape (n_classes,)

class_count_¶

number of training samples observed in each class.

Type: array, shape (n_classes,)

theta_¶

mean of each feature per class

Type: array, shape (n_classes, n_features)

var_¶

variance of each feature per class

Type: array, shape (n_classes, n_features)

epsilon_¶

absolute additive value to variances (unrelated to epsilon parameter for differential privacy)

Type: float

References

VSB13: Vaidya, Jaideep, Basit Shafiq, Anirban Basu, and Yuan Hong. “Differentially private naive bayes classification.” In 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), vol. 1, pp. 571-576. IEEE, 2013.

fit(X, y, sample_weight=None)[source]¶

Fit Gaussian Naive Bayes according to X, y.

Parameters

X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – Target values.
sample_weight (array-like of shape (n_samples,), default=None) –
Weights applied to individual samples (1. for unweighted).

New in version 0.17: Gaussian Naive Bayes supports fitting with sample_weight.

Returns

self – Returns the instance itself.

Return type

object

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

partial_fit(X, y, classes=None, sample_weight=None)[source]¶

Incremental fit on a batch of samples.

This method is expected to be called several times consecutively on different chunks of a dataset so as to implement out-of-core or online learning.

This is especially useful when the whole dataset is too big to fit in memory at once.

This method has some performance and numerical stability overhead, hence it is better to call partial_fit on chunks of data that are as large as possible (as long as fitting in the memory budget) to hide the overhead.

Parameters

X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – Target values.
classes (array-like of shape (n_classes,), default=None) –
List of all the classes that can possibly appear in the y vector.

Must be provided at the first call to partial_fit, can be omitted in subsequent calls.
sample_weight (array-like of shape (n_samples,), default=None) –
Weights applied to individual samples (1. for unweighted).

New in version 0.17.

Returns

self – Returns the instance itself.

Return type

object

predict(X)¶

Perform classification on an array of test vectors X.

Parameters: X (array-like of shape (n_samples, n_features)) – The input samples.
Returns: C – Predicted target values for X.
Return type: ndarray of shape (n_samples,)

predict_log_proba(X)¶

Return log-probability estimates for the test vector X.

Parameters: X (array-like of shape (n_samples, n_features)) – The input samples.
Returns: C – Returns the log-probability of the samples for each class in the model. The columns correspond to the classes in sorted order, as they appear in the attribute classes_.
Return type: array-like of shape (n_samples, n_classes)

predict_proba(X)¶

Return probability estimates for the test vector X.

Parameters: X (array-like of shape (n_samples, n_features)) – The input samples.
Returns: C – Returns the probability of the samples for each class in the model. The columns correspond to the classes in sorted order, as they appear in the attribute classes_.
Return type: array-like of shape (n_samples, n_classes)

score(X, y, sample_weight=None)¶

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters

X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

score – Mean accuracy of self.predict(X) wrt. y.

Return type

float

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

property sigma_¶: Variance of each feature per class.

Logistic Regression¶

class diffprivlib.models.LogisticRegression(*, epsilon=1.0, data_norm=None, tol=0.0001, C=1.0, fit_intercept=True, max_iter=100, verbose=0, warm_start=False, n_jobs=None, accountant=None, **unused_args)[source]¶

Logistic Regression (aka logit, MaxEnt) classifier with differential privacy.

This class implements regularised logistic regression using Scipy’s L-BFGS-B algorithm. \(\epsilon\)-Differential privacy is achieved relative to the maximum norm of the data, as determined by data_norm, by the Vector mechanism, which adds a Laplace-distributed random vector to the objective. Adapted from the work presented in [CMS11].

This class is a child of sklearn.linear_model.LogisticRegression, with amendments to allow for the implementation of differential privacy. Some parameters of Scikit Learn’s model have therefore had to be fixed, including:

The only permitted solver is ‘lbfgs’. Specifying the solver option will result in a warning.

Consequently, the only permitted penalty is ‘l2’. Specifying the penalty option will result in a warning.

In the multiclass case, only the one-vs-rest (OvR) scheme is permitted. Specifying the multi_class option will result in a warning.

Parameters

epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).
data_norm (float, optional) –
The max l2 norm of any row of the data. This defines the spread of data that will be protected by differential privacy.

If not specified, the max norm is taken from the data when .fit() is first called, but will result in a PrivacyLeakWarning, as it reveals information about the data. To preserve differential privacy fully, data_norm should be selected independently of the data, i.e. with domain knowledge.
tol (float, default: 1e-4) – Tolerance for stopping criteria.
C (float, default: 1.0) – Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
fit_intercept (bool, default: True) – Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.
max_iter (int, default: 100) – Maximum number of iterations taken for the solver to converge. For smaller epsilon (more noise), max_iter may need to be increased.
verbose (int, default: 0) – Set to any positive number for verbosity.
warm_start (bool, default: False) – When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.
n_jobs (int, optional) – Number of CPU cores used when parallelising over classes. None means 1 unless in a context. -1 means using all processors.
accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

classes_¶

A list of class labels known to the classifier.

Type: array, shape (n_classes, )

coef_¶

Coefficient of the features in the decision function.

coef_ is of shape (1, n_features) when the given problem is binary.

Type: array, shape (1, n_features) or (n_classes, n_features)

intercept_¶

Intercept (a.k.a. bias) added to the decision function.

If fit_intercept is set to False, the intercept is set to zero. intercept_ is of shape (1,) when the given problem is binary.

Type: array, shape (1,) or (n_classes,)

n_iter_¶

Actual number of iterations for all classes. If binary, it returns only 1 element.

Type: array, shape (n_classes,) or (1, )

Examples

>>> from sklearn.datasets import load_iris
>>> from diffprivlib.models import LogisticRegression
>>> X, y = load_iris(return_X_y=True)
>>> clf = LogisticRegression(data_norm=12, epsilon=2).fit(X, y)
>>> clf.predict(X[:2, :])
array([0, 0])
>>> clf.predict_proba(X[:2, :])
array([[7.35362932e-01, 2.16667422e-14, 2.64637068e-01],
       [9.08384378e-01, 3.47767052e-13, 9.16156215e-02]])
>>> clf.score(X, y)
0.5266666666666666

Random Forest¶

class diffprivlib.models.RandomForestClassifier(n_estimators=10, *, epsilon=1.0, cat_feature_threshold=10, n_jobs=1, verbose=0, accountant=None, max_depth=15, random_state=None, feature_domains=None, **unused_args)[source]¶

Random Forest Classifier with differential privacy.

This class implements Differentially Private Random Decision Forests using Smooth Sensitivity [1]. \(\epsilon\)-Differential privacy is achieved by constructing decision trees via random splitting criterion and applying Exponential Mechanism to produce a noisy label.

Parameters

n_estimators (int, default: 10) – The number of trees in the forest.
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).
cat_feature_threshold (int, default: 10) – Threshold value used to determine categorical features. For example, value of 10 means any feature that has less than or equal to 10 unique values will be treated as a categorical feature.
n_jobs (int, default: 1) – Number of CPU cores used when parallelising over classes. -1 means using all processors.
verbose (int, default: 0) – Set to any positive number for verbosity.
accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.
max_depth (int, default: 15) – The maximum depth of the tree. Final depth of the tree will be calculated based on the number of continuous and categorical features, but it wont be more than this number. Note: The depth translates to an exponential increase in memory usage.
random_state (float, optional) – Sets the numpy random seed.
feature_domains (dict, optional) – A dictionary of domain values for all features where keys are the feature indexes in the training data and the values are an array of domain values for categorical features and an array of min and max values for continuous features. For example, if the training data is [[2, ‘dog’], [5, ‘cat’], [7, ‘dog’]], then the feature_domains would be {‘0’: [2, 7], ‘1’: [‘dog’, ‘cat’]}. If not provided, feature domains will be constructed from the data, but this will result in PrivacyLeakWarning.

n_features_in_¶

The number of features when fit is performed.

Type: int

n_classes_¶

The number of classes.

Type: int

classes_¶

The classes labels.

Type: array of shape (n_classes, )

cat_features_¶

Categorical feature indexes.

Type: array of categorical feature indexes

max_depth_¶

Final max depth used for constructing decision trees.

Type: int

estimators_¶

The collection of fitted sub-estimators.

Type: list of DecisionTreeClassifier

feature_domains_¶

indexes in the training data

Type: dictionary of domain values mapped to feature

Examples

>>> from sklearn.datasets import make_classification
>>> from diffprivlib.models import RandomForestClassifier
>>> X, y = make_classification(n_samples=1000, n_features=4,
...                            n_informative=2, n_redundant=0,
...                            random_state=0, shuffle=False)
>>> clf = RandomForestClassifier(n_estimators=100, random_state=0)
>>> clf.fit(X, y)
>>> print(clf.predict([[0, 0, 0, 0]]))
[1]

References

[1] Sam Fletcher, Md Zahidul Islam. “Differentially Private Random Decision Forests using Smooth Sensitivity” https://arxiv.org/abs/1606.03572

fit(X, y, sample_weight=None)[source]¶

Fit the model to the given training data.

Parameters

X (array-like, shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape (n_samples,)) – Target vector relative to X.
sample_weight (ignored) – Ignored by diffprivlib. Present for consistency with sklearn API.

Returns

self

Return type

class

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

property n_features_¶

Attribute n_features_ was deprecated in version 1.0 and will be removed in 1.2. Use n_features_in_ instead.

Number of features when fitting the estimator.

Type: DEPRECATED

predict(X)¶

Predict class for X.

The predicted class of an input sample is a vote by the trees in the forest, weighted by their probability estimates. That is, the predicted class is the one with highest mean probability estimate across the trees.

Parameters: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples. Internally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.
Returns: y – The predicted classes.
Return type: ndarray of shape (n_samples,) or (n_samples, n_outputs)

predict_log_proba(X)¶

Predict class log-probabilities for X.

The predicted class log-probabilities of an input sample is computed as the log of the mean predicted class probabilities of the trees in the forest.

Parameters: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples. Internally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.
Returns: p – The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.
Return type: ndarray of shape (n_samples, n_classes), or a list of such arrays

predict_proba(X)¶

Predict class probabilities for X.

The predicted class probabilities of an input sample are computed as the mean predicted class probabilities of the trees in the forest. The class probability of a single tree is the fraction of samples of the same class in a leaf.

Parameters: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples. Internally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.
Returns: p – The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.
Return type: ndarray of shape (n_samples, n_classes), or a list of such arrays

score(X, y, sample_weight=None)¶

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters

X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

score – Mean accuracy of self.predict(X) wrt. y.

Return type

float

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

Regression models¶

Linear Regression¶

class diffprivlib.models.LinearRegression(*, epsilon=1.0, bounds_X=None, bounds_y=None, fit_intercept=True, copy_X=True, accountant=None, **unused_args)[source]¶

Ordinary least squares Linear Regression with differential privacy.

LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. Differential privacy is guaranteed with respect to the training sample.

Differential privacy is achieved by adding noise to the coefficients of the objective function, taking inspiration from [ZZX12].

Parameters

epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).
bounds_X (tuple) – Bounds of the data, provided as a tuple of the form (min, max). min and max can either be scalars, covering the min/max of the entire data, or vectors with one entry per feature. If not provided, the bounds are computed on the data when .fit() is first called, resulting in a PrivacyLeakWarning.
bounds_y (tuple) – Same as bounds_X, but for the training label set y.
fit_intercept (bool, default: True) – Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).
copy_X (bool, default: True) – If True, X will be copied; else, it may be overwritten.
accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

coef_¶

Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features.

Type: array of shape (n_features, ) or (n_targets, n_features)

intercept_¶

Independent term in the linear model. Set to 0.0 if fit_intercept = False.

Type: float or array of shape of (n_targets,)

References

ZZX12: Zhang, Jun, Zhenjie Zhang, Xiaokui Xiao, Yin Yang, and Marianne Winslett. “Functional mechanism: regression analysis under differential privacy.” arXiv preprint arXiv:1208.0219 (2012).

fit(X, y, sample_weight=None)[source]¶

Fit linear model.

Parameters

X (array-like or sparse matrix, shape (n_samples, n_features)) – Training data
y (array_like, shape (n_samples, n_targets)) – Target values. Will be cast to X’s dtype if necessary
sample_weight (ignored) – Ignored by diffprivlib. Present for consistency with sklearn API.

Returns

self

Return type

returns an instance of self.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

predict(X)¶

Predict using the linear model.

Parameters: X (array-like or sparse matrix, shape (n_samples, n_features)) – Samples.
Returns: C – Returns predicted values.
Return type: array, shape (n_samples,)

score(X, y, sample_weight=None)¶

Return the coefficient of determination of the prediction.

The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred)** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.

Parameters

X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

score – \(R^2\) of self.predict(X) wrt. y.

Return type

float

Notes

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

Clustering models¶

K-Means¶

class diffprivlib.models.KMeans(n_clusters=8, *, epsilon=1.0, bounds=None, accountant=None, **unused_args)[source]¶

K-Means clustering with differential privacy.

Implements the DPLloyd approach presented in [SCL16], leveraging the sklearn.cluster.KMeans class for full integration with Scikit Learn.

Parameters

n_clusters (int, default: 8) – The number of clusters to form as well as the number of centroids to generate.
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).
bounds (tuple, optional) – Bounds of the data, provided as a tuple of the form (min, max). min and max can either be scalars, covering the min/max of the entire data, or vectors with one entry per feature. If not provided, the bounds are computed on the data when .fit() is first called, resulting in a PrivacyLeakWarning.
accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

cluster_centers_¶

Coordinates of cluster centers. If the algorithm stops before fully converging, these will not be consistent with labels_.

Type: array, [n_clusters, n_features]

labels_¶: Labels of each point

inertia_¶

Sum of squared distances of samples to their closest cluster center.

Type: float

n_iter_¶

Number of iterations run.

Type: int

References

SCL16: Su, Dong, Jianneng Cao, Ninghui Li, Elisa Bertino, and Hongxia Jin. “Differentially private k-means clustering.” In Proceedings of the sixth ACM conference on data and application security and privacy, pp. 26-37. ACM, 2016.

fit(X, y=None, sample_weight=None)[source]¶

Computes k-means clustering with differential privacy.

Parameters

X (array-like, shape=(n_samples, n_features)) – Training instances to cluster.
y (Ignored) – not used, present here for API consistency by convention.
sample_weight (ignored) – Ignored by diffprivlib. Present for consistency with sklearn API.

Returns

self

Return type

class

fit_predict(X, y=None, sample_weight=None)[source]¶

Compute cluster centers and predict cluster index for each sample.

Convenience method; equivalent to calling fit(X) followed by predict(X).

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data to transform.
y (Ignored) – Not used, present here for API consistency by convention.
sample_weight (array-like of shape (n_samples,), default=None) – The weights for each observation in X. If None, all observations are assigned equal weight.

Returns

labels – Index of the cluster each sample belongs to.

Return type

ndarray of shape (n_samples,)

fit_transform(X, y=None, sample_weight=None)[source]¶

Compute clustering and transform X to cluster-distance space.

Equivalent to fit(X).transform(X), but more efficiently implemented.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data to transform.
y (Ignored) – Not used, present here for API consistency by convention.
sample_weight (array-like of shape (n_samples,), default=None) – The weights for each observation in X. If None, all observations are assigned equal weight.

Returns

X_new – X transformed in the new space.

Return type

ndarray of shape (n_samples, n_clusters)

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

predict(X, sample_weight=None)[source]¶

Predict the closest cluster each sample in X belongs to.

In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data to predict.
sample_weight (array-like of shape (n_samples,), default=None) – The weights for each observation in X. If None, all observations are assigned equal weight.

Returns

labels – Index of the cluster each sample belongs to.

Return type

ndarray of shape (n_samples,)

score(X, y=None, sample_weight=None)[source]¶

Opposite of the value of X on the K-means objective.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data.
y (Ignored) – Not used, present here for API consistency by convention.
sample_weight (array-like of shape (n_samples,), default=None) – The weights for each observation in X. If None, all observations are assigned equal weight.

Returns

score – Opposite of the value of X on the K-means objective.

Return type

float

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

transform(X)[source]¶

Transform X to a cluster-distance space.

In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the array returned by transform will typically be dense.

Parameters: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data to transform.
Returns: X_new – X transformed in the new space.
Return type: ndarray of shape (n_samples, n_clusters)

Dimensionality reduction models¶

PCA¶

class diffprivlib.models.PCA(n_components=None, *, epsilon=1.0, data_norm=None, centered=False, bounds=None, copy=True, whiten=False, random_state=None, accountant=None, **unused_args)[source]¶

Principal component analysis (PCA) with differential privacy.

This class is a child of sklearn.decomposition.PCA, with amendments to allow for the implementation of differential privacy as given in [IS16b]. Some parameters of Scikit Learn’s model have therefore had to be fixed, including:

The only permitted svd_solver is ‘full’. Specifying the svd_solver option will result in a warning;

The parameters tol and iterated_power are not applicable (as a consequence of fixing svd_solver = 'full').

Parameters

n_components (int, float, None or str) –
Number of components to keep. If n_components is not set all components are kept:
```
n_components == min(n_samples, n_features)
```
If n_components == 'mle', Minka’s MLE is used to guess the dimension.

If 0 < n_components < 1, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.

Hence, the None case results in:
```
n_components == min(n_samples, n_features) - 1
```
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\). If centered=False, half of epsilon is used to calculate the differentially private mean to center the data prior to the calculation of principal components.
data_norm (float, optional) –
The max l2 norm of any row of the data. This defines the spread of data that will be protected by differential privacy.

If not specified, the max norm is taken from the data when .fit() is first called, but will result in a PrivacyLeakWarning, as it reveals information about the data. To preserve differential privacy fully, data_norm should be selected independently of the data, i.e. with domain knowledge.
centered (bool, default: False) –
If False, the data will be centered before calculating the principal components. This will be calculated with differential privacy, consuming privacy budget from epsilon.

If True, the data is assumed to have been centered previously (e.g. using StandardScaler), and therefore will not require the consumption of privacy budget to calculate the mean.
bounds (tuple, optional) – Bounds of the data, provided as a tuple of the form (min, max). min and max can either be scalars, covering the min/max of the entire data, or vectors with one entry per feature. If not provided, the bounds are computed on the data when .fit() is first called, resulting in a PrivacyLeakWarning.
copy (bool, default: True) – If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.
whiten (bool, default: False) –
When True (False by default) the components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.

Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.
random_state (int or RandomState instance, optional) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator.
accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

components_¶

Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.

Type: array, shape (n_components, n_features)

explained_variance_¶

The amount of variance explained by each of the selected components.

Equal to n_components largest eigenvalues of the covariance matrix of X.

Type: array, shape (n_components,)

explained_variance_ratio_¶

Percentage of variance explained by each of the selected components.

If n_components is not set then all components are stored and the sum of the ratios is equal to 1.0.

Type: array, shape (n_components,)

singular_values_¶

The singular values corresponding to each of the selected components. The singular values are equal to the 2-norms of the n_components variables in the lower-dimensional space.

Type: array, shape (n_components,)

mean_¶

Per-feature empirical mean, estimated from the training set.

Equal to X.mean(axis=0).

Type: array, shape (n_features,)

n_components_¶

The estimated number of components. When n_components is set to ‘mle’ or a number between 0 and 1 (with svd_solver == ‘full’) this number is estimated from input data. Otherwise it equals the parameter n_components, or the lesser value of n_features and n_samples if n_components is None.

Type: int

n_features_¶

Number of features in the training data.

Type: int

n_samples_¶

Number of samples in the training data.

Type: int

noise_variance_¶

The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf. It is required to compute the estimated data covariance and score samples.

Equal to the average of (min(n_features, n_samples) - n_components) smallest eigenvalues of the covariance matrix of X.

Type: float

Preprocessing¶

Standard Scaler¶

class diffprivlib.models.StandardScaler(*, epsilon=1.0, bounds=None, copy=True, with_mean=True, with_std=True, accountant=None)[source]¶

Standardize features by removing the mean and scaling to unit variance, calculated with differential privacy guarantees. Differential privacy is guaranteed on the learned scaler with respect to the training sample; the transformed output will certainly not satisfy differential privacy.

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the (differentially private) mean of the training samples or zero if with_mean=False, and s is the (differentially private) standard deviation of the training samples or one if with_std=False.

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using the transform method.

For further information, users are referred to sklearn.preprocessing.StandardScaler.

Parameters

epsilon (float, default: 1.0) – The privacy budget to be allocated to learning the mean and variance of the training sample. If with_std=True, the privacy budget is split evenly between mean and variance (the mean must be calculated even when with_mean=False, as it is used in the calculation of the variance.
bounds (tuple, optional) – Bounds of the data, provided as a tuple of the form (min, max). min and max can either be scalars, covering the min/max of the entire data, or vectors with one entry per feature. If not provided, the bounds are computed on the data when .fit() is first called, resulting in a PrivacyLeakWarning.
copy (boolean, default: True) – If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array, a copy may still be returned.
with_mean (boolean, True by default) – If True, center the data before scaling.
with_std (boolean, True by default) – If True, scale the data to unit variance (or equivalently, unit standard deviation).
accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

scale_¶

Per feature relative scaling of the data. This is calculated using np.sqrt(var_). Equal to None when with_std=False.

Type: ndarray or None, shape (n_features,)

mean_¶

The mean value for each feature in the training set. Equal to None when with_mean=False.

Type: ndarray or None, shape (n_features,)

var_¶

The variance for each feature in the training set. Used to compute scale_. Equal to None when with_std=False.

Type: ndarray or None, shape (n_features,)

n_samples_seen_¶

The number of samples processed by the estimator for each feature. If there are not missing samples, the n_samples_seen will be an integer, otherwise it will be an array. Will be reset on new calls to fit, but increments across partial_fit calls.

Type: int or array, shape (n_features,)

diffprivlib.models¶

Classification models¶

Gaussian Naive Bayes¶

Logistic Regression¶

Random Forest¶

Regression models¶

Linear Regression¶

Clustering models¶

K-Means¶

Dimensionality reduction models¶

PCA¶

Preprocessing¶

Standard Scaler¶

`diffprivlib.models`¶