diffprivlib.models
¶
Machine learning models with differential privacy
Classification models¶
Gaussian Naive Bayes¶

class
diffprivlib.models.
GaussianNB
(epsilon=1.0, bounds=None, priors=None, var_smoothing=1e09, accountant=None)[source]¶ Gaussian Naive Bayes (GaussianNB) with differential privacy
Inherits the
sklearn.naive_bayes.GaussianNB
class from Scikit Learn and adds noise to satisfy differential privacy to the learned means and variances. Adapted from the work presented in [VSB13]. Parameters
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\) for the model.
bounds (tuple, optional) – Bounds of the data, provided as a tuple of the form (min, max). min and max can either be scalars, covering the min/max of the entire data, or vectors with one entry per feature. If not provided, the bounds are computed on the data when
.fit()
is first called, resulting in aPrivacyLeakWarning
.priors (arraylike, shape (n_classes,)) – Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
var_smoothing (float, default: 1e9) – Portion of the largest variance of all features that is added to variances for calculation stability.
accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

class_prior_
¶ probability of each class.
 Type
array, shape (n_classes,)

class_count_
¶ number of training samples observed in each class.
 Type
array, shape (n_classes,)

theta_
¶ mean of each feature per class
 Type
array, shape (n_classes, n_features)

sigma_
¶ variance of each feature per class
 Type
array, shape (n_classes, n_features)

epsilon_
¶ absolute additive value to variances (unrelated to
epsilon
parameter for differential privacy) Type
References
 VSB13
Vaidya, Jaideep, Basit Shafiq, Anirban Basu, and Yuan Hong. “Differentially private naive bayes classification.” In 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), vol. 1, pp. 571576. IEEE, 2013.

fit
(X, y, sample_weight=None)[source]¶ Fit Gaussian Naive Bayes according to X, y
 Parameters
X (arraylike, shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (arraylike, shape (n_samples,)) – Target values.
sample_weight (arraylike, shape (n_samples,), optional (default=None)) –
Weights applied to individual samples (1. for unweighted).
New in version 0.17: Gaussian Naive Bayes supports fitting with sample_weight.
 Returns
self
 Return type

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
params – Parameter names mapped to their values.
 Return type
mapping of string to any

partial_fit
(X, y, classes=None, sample_weight=None)[source]¶ Incremental fit on a batch of samples.
This method is expected to be called several times consecutively on different chunks of a dataset so as to implement outofcore or online learning.
This is especially useful when the whole dataset is too big to fit in memory at once.
This method has some performance and numerical stability overhead, hence it is better to call partial_fit on chunks of data that are as large as possible (as long as fitting in the memory budget) to hide the overhead.
 Parameters
X (arraylike, shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (arraylike, shape (n_samples,)) – Target values.
classes (arraylike, shape (n_classes,), optional (default=None)) –
List of all the classes that can possibly appear in the y vector.
Must be provided at the first call to partial_fit, can be omitted in subsequent calls.
sample_weight (arraylike, shape (n_samples,), optional (default=None)) –
Weights applied to individual samples (1. for unweighted).
New in version 0.17.
 Returns
self
 Return type

predict
(X)¶ Perform classification on an array of test vectors X.
 Parameters
X (arraylike of shape (n_samples, n_features)) –
 Returns
C – Predicted target values for X
 Return type
ndarray of shape (n_samples,)

predict_log_proba
(X)¶ Return logprobability estimates for the test vector X.
 Parameters
X (arraylike of shape (n_samples, n_features)) –
 Returns
C – Returns the logprobability of the samples for each class in the model. The columns correspond to the classes in sorted order, as they appear in the attribute classes_.
 Return type
arraylike of shape (n_samples, n_classes)

predict_proba
(X)¶ Return probability estimates for the test vector X.
 Parameters
X (arraylike of shape (n_samples, n_features)) –
 Returns
C – Returns the probability of the samples for each class in the model. The columns correspond to the classes in sorted order, as they appear in the attribute classes_.
 Return type
arraylike of shape (n_samples, n_classes)

score
(X, y, sample_weight=None)¶ Return the mean accuracy on the given test data and labels.
In multilabel classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
 Parameters
X (arraylike of shape (n_samples, n_features)) – Test samples.
y (arraylike of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (arraylike of shape (n_samples,), default=None) – Sample weights.
 Returns
score – Mean accuracy of self.predict(X) wrt. y.
 Return type

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.
Logistic Regression¶

class
diffprivlib.models.
LogisticRegression
(epsilon=1.0, data_norm=None, tol=0.0001, C=1.0, fit_intercept=True, max_iter=100, verbose=0, warm_start=False, n_jobs=None, accountant=None, **unused_args)[source]¶ Logistic Regression (aka logit, MaxEnt) classifier with differential privacy.
This class implements regularised logistic regression using Scipy’s LBFGSB algorithm. \(\epsilon\)Differential privacy is achieved relative to the maximum norm of the data, as determined by data_norm, by the
Vector
mechanism, which adds a Laplacedistributed random vector to the objective. Adapted from the work presented in [CMS11].This class is a child of
sklearn.linear_model.LogisticRegression
, with amendments to allow for the implementation of differential privacy. Some parameters of Scikit Learn’s model have therefore had to be fixed, including:The only permitted solver is ‘lbfgs’. Specifying the
solver
option will result in a warning.Consequently, the only permitted penalty is ‘l2’. Specifying the
penalty
option will result in a warning.In the multiclass case, only the onevsrest (OvR) scheme is permitted. Specifying the
multi_class
option will result in a warning.
 Parameters
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).
data_norm (float, optional) –
The max l2 norm of any row of the data. This defines the spread of data that will be protected by differential privacy.
If not specified, the max norm is taken from the data when
.fit()
is first called, but will result in aPrivacyLeakWarning
, as it reveals information about the data. To preserve differential privacy fully, data_norm should be selected independently of the data, i.e. with domain knowledge.tol (float, default: 1e4) – Tolerance for stopping criteria.
C (float, default: 1.0) – Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
fit_intercept (bool, default: True) – Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.
max_iter (int, default: 100) – Maximum number of iterations taken for the solver to converge. For smaller epsilon (more noise), max_iter may need to be increased.
verbose (int, default: 0) – Set to any positive number for verbosity.
warm_start (bool, default: False) – When set to
True
, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.n_jobs (int, optional) – Number of CPU cores used when parallelising over classes.
None
means 1 unless in a context.1
means using all processors.accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

classes_
¶ A list of class labels known to the classifier.
 Type
array, shape (n_classes, )

coef_
¶ Coefficient of the features in the decision function.
coef_ is of shape (1, n_features) when the given problem is binary.
 Type
array, shape (1, n_features) or (n_classes, n_features)

intercept_
¶ Intercept (a.k.a. bias) added to the decision function.
If fit_intercept is set to False, the intercept is set to zero. intercept_ is of shape (1,) when the given problem is binary.
 Type
array, shape (1,) or (n_classes,)

n_iter_
¶ Actual number of iterations for all classes. If binary, it returns only 1 element.
 Type
array, shape (n_classes,) or (1, )
Examples
>>> from sklearn.datasets import load_iris >>> from diffprivlib.models import LogisticRegression >>> X, y = load_iris(return_X_y=True) >>> clf = LogisticRegression(data_norm=12, epsilon=2).fit(X, y) >>> clf.predict(X[:2, :]) array([0, 0]) >>> clf.predict_proba(X[:2, :]) array([[7.35362932e01, 2.16667422e14, 2.64637068e01], [9.08384378e01, 3.47767052e13, 9.16156215e02]]) >>> clf.score(X, y) 0.5266666666666666
See also
sklearn.linear_model.LogisticRegression
The implementation of logistic regression in scikitlearn, upon which this implementation is built.
Vector
The mechanism used by the model to achieve differential privacy.
References
 CMS11
Chaudhuri, Kamalika, Claire Monteleoni, and Anand D. Sarwate. “Differentially private empirical risk minimization.” Journal of Machine Learning Research 12, no. Mar (2011): 10691109.

decision_function
(X)¶ Predict confidence scores for samples.
The confidence score for a sample is the signed distance of that sample to the hyperplane.
 Parameters
X (array_like or sparse matrix, shape (n_samples, n_features)) – Samples.
 Returns
Confidence scores per (sample, class) combination. In the binary case, confidence score for self.classes_[1] where >0 means this class would be predicted.
 Return type
array, shape=(n_samples,) if n_classes == 2 else (n_samples, n_classes)

densify
()¶ Convert coefficient matrix to dense array format.
Converts the
coef_
member (back) to a numpy.ndarray. This is the default format ofcoef_
and is required for fitting, so calling this method is only required on models that have previously been sparsified; otherwise, it is a noop. Returns
Fitted estimator.
 Return type
self

fit
(X, y, sample_weight=None)[source]¶ Fit the model according to the given training data.
 Parameters
X ({arraylike, sparse matrix}, shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (arraylike, shape (n_samples,)) – Target vector relative to X.
sample_weight (ignored) – Ignored by diffprivlib. Present for consistency with sklearn API.
 Returns
self
 Return type
class

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
params – Parameter names mapped to their values.
 Return type
mapping of string to any

predict
(X)¶ Predict class labels for samples in X.
 Parameters
X (array_like or sparse matrix, shape (n_samples, n_features)) – Samples.
 Returns
C – Predicted class label per sample.
 Return type
array, shape [n_samples]

predict_log_proba
(X)[source]¶ Predict logarithm of probability estimates.
The returned estimates for all classes are ordered by the label of classes.
 Parameters
X (arraylike of shape (n_samples, n_features)) – Vector to be scored, where n_samples is the number of samples and n_features is the number of features.
 Returns
T – Returns the logprobability of the sample for each class in the model, where classes are ordered as they are in
self.classes_
. Return type
arraylike of shape (n_samples, n_classes)

predict_proba
(X)[source]¶ Probability estimates.
The returned estimates for all classes are ordered by the label of classes.
For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class. Else use a onevsrest approach, i.e calculate the probability of each class assuming it to be positive using the logistic function. and normalize these values across all the classes.
 Parameters
X (arraylike of shape (n_samples, n_features)) – Vector to be scored, where n_samples is the number of samples and n_features is the number of features.
 Returns
T – Returns the probability of the sample for each class in the model, where classes are ordered as they are in
self.classes_
. Return type
arraylike of shape (n_samples, n_classes)

score
(X, y, sample_weight=None)¶ Return the mean accuracy on the given test data and labels.
In multilabel classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
 Parameters
X (arraylike of shape (n_samples, n_features)) – Test samples.
y (arraylike of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (arraylike of shape (n_samples,), default=None) – Sample weights.
 Returns
score – Mean accuracy of self.predict(X) wrt. y.
 Return type

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.

sparsify
()¶ Convert coefficient matrix to sparse format.
Converts the
coef_
member to a scipy.sparse matrix, which for L1regularized models can be much more memory and storageefficient than the usual numpy.ndarray representation.The
intercept_
member is not converted. Returns
Fitted estimator.
 Return type
self
Notes
For nonsparse models, i.e. when there are not many zeros in
coef_
, this may actually increase memory usage, so use this method with care. A rule of thumb is that the number of zero elements, which can be computed with(coef_ == 0).sum()
, must be more than 50% for this to provide significant benefits.After calling this method, further fitting with the partial_fit method (if any) will not work until you call densify.
Regression models¶
Linear Regression¶

class
diffprivlib.models.
LinearRegression
(epsilon=1.0, data_norm=None, bounds_X=None, bounds_y=None, fit_intercept=True, copy_X=True, accountant=None, **unused_args)[source]¶ Ordinary least squares Linear Regression with differential privacy.
LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. Differential privacy is guaranteed with respect to the training sample.
Differential privacy is achieved by adding noise to the second moment matrix using the
Wishart
mechanism. This method is demonstrated in [She15], but our implementation takes inspiration from the use of the Wishart distribution in [IS16] to achieve a strict differential privacy guarantee. Parameters
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).
data_norm (float, optional) –
The max l2 norm of any row of the concatenated dataset A = [X; y]. This defines the spread of data that will be protected by differential privacy.
If not specified, the max norm is taken from the data when
.fit()
is first called, but will result in aPrivacyLeakWarning
, as it reveals information about the data. To preserve differential privacy fully, data_norm should be selected independently of the data, i.e. with domain knowledge.bounds_X (tuple, optional) – Bounds of the data, provided as a tuple of the form (min, max). min and max can either be scalars, covering the min/max of the entire data, or vectors with one entry per feature. If not provided, the bounds are computed on the data when
.fit()
is first called, resulting in aPrivacyLeakWarning
.bounds_y (tuple) – Same as bounds_X, but for the training label set y.
fit_intercept (bool, default: True) – Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).
copy_X (bool, default: True) – If True, X will be copied; else, it may be overwritten.
accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

coef_
¶ Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features.
 Type
array of shape (n_features, ) or (n_targets, n_features)

singular_
¶ Singular values of X.
 Type
array of shape (min(X, y),)

intercept_
¶ Independent term in the linear model. Set to 0.0 if fit_intercept = False.
 Type
float or array of shape of (n_targets,)
References
 She15
Sheffet, Or. “Private approximations of the 2ndmoment matrix using existing techniques in linear regression.” arXiv preprint arXiv:1507.00056 (2015).
 IS16
Imtiaz, Hafiz, and Anand D. Sarwate. “Symmetric matrix perturbation for differentiallyprivate principal component analysis.” In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 23392343. IEEE, 2016.

fit
(X, y, sample_weight=None)[source]¶ Fit linear model.
 Parameters
X (arraylike or sparse matrix, shape (n_samples, n_features)) – Training data
y (array_like, shape (n_samples, n_targets)) – Target values. Will be cast to X’s dtype if necessary
sample_weight (ignored) – Ignored by diffprivlib. Present for consistency with sklearn API.
 Returns
self
 Return type
returns an instance of self.

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
params – Parameter names mapped to their values.
 Return type
mapping of string to any

predict
(X)¶ Predict using the linear model.
 Parameters
X (array_like or sparse matrix, shape (n_samples, n_features)) – Samples.
 Returns
C – Returns predicted values.
 Return type
array, shape (n_samples,)

score
(X, y, sample_weight=None)¶ Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1  u/v), where u is the residual sum of squares ((y_true  y_pred) ** 2).sum() and v is the total sum of squares ((y_true  y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
 Parameters
X (arraylike of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.
y (arraylike of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (arraylike of shape (n_samples,), default=None) – Sample weights.
 Returns
score – R^2 of self.predict(X) wrt. y.
 Return type
Notes
The R2 score used when calling
score
on a regressor will usemultioutput='uniform_average'
from version 0.23 to keep consistent withr2_score()
. This will influence thescore
method of all the multioutput regressors (except forMultiOutputRegressor
). To specify the default value manually and avoid the warning, please either callr2_score()
directly or make a custom scorer withmake_scorer()
(the builtin scorer'r2'
usesmultioutput='uniform_average'
).

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.
Clustering models¶
KMeans¶

class
diffprivlib.models.
KMeans
(epsilon=1.0, bounds=None, n_clusters=8, accountant=None, **unused_args)[source]¶ KMeans clustering with differential privacy.
Implements the DPLloyd approach presented in [SCL16], leveraging the
sklearn.cluster.KMeans
class for full integration with Scikit Learn. Parameters
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).
bounds (tuple, optional) – Bounds of the data, provided as a tuple of the form (min, max). min and max can either be scalars, covering the min/max of the entire data, or vectors with one entry per feature. If not provided, the bounds are computed on the data when
.fit()
is first called, resulting in aPrivacyLeakWarning
.n_clusters (int, default: 8) – The number of clusters to form as well as the number of centroids to generate.
accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.(

cluster_centers_
¶ Coordinates of cluster centers. If the algorithm stops before fully converging, these will not be consistent with
labels_
. Type
array, [n_clusters, n_features]

labels_
¶ Labels of each point
References
 SCL16
Su, Dong, Jianneng Cao, Ninghui Li, Elisa Bertino, and Hongxia Jin. “Differentially private kmeans clustering.” In Proceedings of the sixth ACM conference on data and application security and privacy, pp. 2637. ACM, 2016.

fit
(X, y=None, sample_weight=None)[source]¶ Computes kmeans clustering with differential privacy.
 Parameters
X (arraylike, shape=(n_samples, n_features)) – Training instances to cluster.
y (Ignored) – not used, present here for API consistency by convention.
sample_weight (ignored) – Ignored by diffprivlib. Present for consistency with sklearn API.
 Returns
self
 Return type
class

fit_predict
(X, y=None, sample_weight=None)[source]¶ Compute cluster centers and predict cluster index for each sample.
Convenience method; equivalent to calling fit(X) followed by predict(X).
 Parameters
X ({arraylike, sparse matrix} of shape (n_samples, n_features)) – New data to transform.
y (Ignored) – Not used, present here for API consistency by convention.
sample_weight (arraylike, shape (n_samples,), optional) – The weights for each observation in X. If None, all observations are assigned equal weight (default: None).
 Returns
labels – Index of the cluster each sample belongs to.
 Return type
array, shape [n_samples,]

fit_transform
(X, y=None, sample_weight=None)[source]¶ Compute clustering and transform X to clusterdistance space.
Equivalent to fit(X).transform(X), but more efficiently implemented.
 Parameters
X ({arraylike, sparse matrix} of shape (n_samples, n_features)) – New data to transform.
y (Ignored) – Not used, present here for API consistency by convention.
sample_weight (arraylike, shape (n_samples,), optional) – The weights for each observation in X. If None, all observations are assigned equal weight (default: None).
 Returns
X_new – X transformed in the new space.
 Return type
array, shape [n_samples, k]

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
params – Parameter names mapped to their values.
 Return type
mapping of string to any

predict
(X, sample_weight=None)[source]¶ Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
 Parameters
X ({arraylike, sparse matrix} of shape (n_samples, n_features)) – New data to predict.
sample_weight (arraylike, shape (n_samples,), optional) – The weights for each observation in X. If None, all observations are assigned equal weight (default: None).
 Returns
labels – Index of the cluster each sample belongs to.
 Return type
array, shape [n_samples,]

score
(X, y=None, sample_weight=None)[source]¶ Opposite of the value of X on the Kmeans objective.
 Parameters
X ({arraylike, sparse matrix} of shape (n_samples, n_features)) – New data.
y (Ignored) – Not used, present here for API consistency by convention.
sample_weight (arraylike, shape (n_samples,), optional) – The weights for each observation in X. If None, all observations are assigned equal weight (default: None).
 Returns
score – Opposite of the value of X on the Kmeans objective.
 Return type

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.

transform
(X)[source]¶ Transform X to a clusterdistance space.
In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the array returned by transform will typically be dense.
 Parameters
X ({arraylike, sparse matrix} of shape (n_samples, n_features)) – New data to transform.
 Returns
X_new – X transformed in the new space.
 Return type
array, shape [n_samples, k]
Dimensionality reduction models¶
PCA¶

class
diffprivlib.models.
PCA
(n_components=None, centered=False, epsilon=1.0, data_norm=None, bounds=None, copy=True, whiten=False, random_state=None, accountant=None, **unused_args)[source]¶ Principal component analysis (PCA) with differential privacy.
This class is a child of
sklearn.decomposition.PCA
, with amendments to allow for the implementation of differential privacy as given in [IS16b]. Some parameters of Scikit Learn’s model have therefore had to be fixed, including:The only permitted svd_solver is ‘full’. Specifying the
svd_solver
option will result in a warning;The parameters
tol
anditerated_power
are not applicable (as a consequence of fixingsvd_solver = 'full'
).
 Parameters
n_components (int, float, None or str) –
Number of components to keep. If n_components is not set all components are kept:
n_components == min(n_samples, n_features)
If
n_components == 'mle'
, Minka’s MLE is used to guess the dimension.If
0 < n_components < 1
, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.Hence, the None case results in:
n_components == min(n_samples, n_features)  1
centered (bool, default: False) –
If False, the data will be centered before calculating the principal components. This will be calculated with differential privacy, consuming privacy budget from epsilon.
If True, the data is assumed to have been centered previously (e.g. using
StandardScaler
), and therefore will not require the consumption of privacy budget to calculate the mean.epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\). If
centered=False
, half of epsilon is used to calculate the differentially private mean to center the data prior to the calculation of principal components.data_norm (float, optional) –
The max l2 norm of any row of the data. This defines the spread of data that will be protected by differential privacy.
If not specified, the max norm is taken from the data when
.fit()
is first called, but will result in aPrivacyLeakWarning
, as it reveals information about the data. To preserve differential privacy fully, data_norm should be selected independently of the data, i.e. with domain knowledge.bounds (tuple, optional) – Bounds of the data, provided as a tuple of the form (min, max). min and max can either be scalars, covering the min/max of the entire data, or vectors with one entry per feature. If not provided, the bounds are computed on the data when
.fit()
is first called, resulting in aPrivacyLeakWarning
.copy (bool, default: True) – If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.
whiten (bool, default: False) –
When True (False by default) the components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit componentwise variances.
Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hardwired assumptions.
random_state (int or RandomState instance, optional) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator.
accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

components_
¶ Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by
explained_variance_
. Type
array, shape (n_components, n_features)

explained_variance_
¶ The amount of variance explained by each of the selected components.
Equal to n_components largest eigenvalues of the covariance matrix of X.
 Type
array, shape (n_components,)

explained_variance_ratio_
¶ Percentage of variance explained by each of the selected components.
If
n_components
is not set then all components are stored and the sum of the ratios is equal to 1.0. Type
array, shape (n_components,)

singular_values_
¶ The singular values corresponding to each of the selected components. The singular values are equal to the 2norms of the
n_components
variables in the lowerdimensional space. Type
array, shape (n_components,)

mean_
¶ Perfeature empirical mean, estimated from the training set.
Equal to X.mean(axis=0).
 Type
array, shape (n_features,)

n_components_
¶ The estimated number of components. When n_components is set to ‘mle’ or a number between 0 and 1 (with svd_solver == ‘full’) this number is estimated from input data. Otherwise it equals the parameter n_components, or the lesser value of n_features and n_samples if n_components is None.
 Type

noise_variance_
¶ The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/metmppca.pdf. It is required to compute the estimated data covariance and score samples.
Equal to the average of (min(n_features, n_samples)  n_components) smallest eigenvalues of the covariance matrix of X.
 Type
See also
sklearn.decomposition.PCA
Scikitlearn implementation Principal Component Analysis.
References
 IS16b
Imtiaz, Hafiz, and Anand D. Sarwate. “Symmetric matrix perturbation for differentiallyprivate principal component analysis.” In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 23392343. IEEE, 2016.

fit_transform
(X, y=None)[source]¶ Fit the model with X and apply the dimensionality reduction on X.
 Parameters
X (arraylike, shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (None) – Ignored variable.
 Returns
X_new – Transformed values.
 Return type
arraylike, shape (n_samples, n_components)
Notes
This method returns a Fortranordered array. To convert it to a Cordered array, use ‘np.ascontiguousarray’.

get_covariance
()¶ Compute data covariance with the generative model.
cov = components_.T * S**2 * components_ + sigma2 * eye(n_features)
where S**2 contains the explained variances, and sigma2 contains the noise variances. Returns
cov – Estimated covariance of data.
 Return type
array, shape=(n_features, n_features)

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
params – Parameter names mapped to their values.
 Return type
mapping of string to any

get_precision
()¶ Compute data precision matrix with the generative model.
Equals the inverse of the covariance but computed with the matrix inversion lemma for efficiency.
 Returns
precision – Estimated precision of data.
 Return type
array, shape=(n_features, n_features)

inverse_transform
(X)¶ Transform data back to its original space.
In other words, return an input X_original whose transform would be X.
 Parameters
X (arraylike, shape (n_samples, n_components)) – New data, where n_samples is the number of samples and n_components is the number of components.
 Returns
 Return type
X_original arraylike, shape (n_samples, n_features)
Notes
If whitening is enabled, inverse_transform will compute the exact inverse operation, which includes reversing whitening.

score
(X, y=None)[source]¶ Return the average loglikelihood of all samples.
See. “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/metmppca.pdf

score_samples
(X)[source]¶ Return the loglikelihood of each sample.
See. “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/metmppca.pdf
 Parameters
X (array, shape(n_samples, n_features)) – The data.
 Returns
ll – Loglikelihood of each sample under the current model.
 Return type
array, shape (n_samples,)

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.

transform
(X)¶ Apply dimensionality reduction to X.
X is projected on the first principal components previously extracted from a training set.
 Parameters
X (arraylike, shape (n_samples, n_features)) – New data, where n_samples is the number of samples and n_features is the number of features.
 Returns
X_new
 Return type
arraylike, shape (n_samples, n_components)
Examples
>>> import numpy as np >>> from sklearn.decomposition import IncrementalPCA >>> X = np.array([[1, 1], [2, 1], [3, 2], [1, 1], [2, 1], [3, 2]]) >>> ipca = IncrementalPCA(n_components=2, batch_size=3) >>> ipca.fit(X) IncrementalPCA(batch_size=3, n_components=2) >>> ipca.transform(X)
Preprocessing¶
Standard Scaler¶

class
diffprivlib.models.
StandardScaler
(epsilon=1.0, bounds=None, copy=True, with_mean=True, with_std=True, accountant=None)[source]¶ Standardize features by removing the mean and scaling to unit variance, calculated with differential privacy guarantees. Differential privacy is guaranteed on the learned scaler with respect to the training sample; the transformed output will certainly not satisfy differential privacy.
The standard score of a sample x is calculated as:
z = (x  u) / s
where u is the (differentially private) mean of the training samples or zero if with_mean=False, and s is the (differentially private) standard deviation of the training samples or one if with_std=False.
Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using the transform method.
For further information, users are referred to
sklearn.preprocessing.StandardScaler
. Parameters
epsilon (float, default: 1.0) – The privacy budget to be allocated to learning the mean and variance of the training sample. If with_std=True, the privacy budget is split evenly between mean and variance (the mean must be calculated even when with_mean=False, as it is used in the calculation of the variance.
bounds (tuple, optional) – Bounds of the data, provided as a tuple of the form (min, max). min and max can either be scalars, covering the min/max of the entire data, or vectors with one entry per feature. If not provided, the bounds are computed on the data when
.fit()
is first called, resulting in aPrivacyLeakWarning
.copy (boolean, default: True) – If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array, a copy may still be returned.
with_mean (boolean, True by default) – If True, center the data before scaling.
with_std (boolean, True by default) – If True, scale the data to unit variance (or equivalently, unit standard deviation).
accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

scale_
¶ Per feature relative scaling of the data. This is calculated using np.sqrt(var_). Equal to
None
whenwith_std=False
. Type
ndarray or None, shape (n_features,)

mean_
¶ The mean value for each feature in the training set. Equal to
None
whenwith_mean=False
. Type
ndarray or None, shape (n_features,)

var_
¶ The variance for each feature in the training set. Used to compute scale_. Equal to
None
whenwith_std=False
. Type
ndarray or None, shape (n_features,)

n_samples_seen_
¶ The number of samples processed by the estimator for each feature. If there are not missing samples, the
n_samples_seen
will be an integer, otherwise it will be an array. Will be reset on new calls to fit, but increments acrosspartial_fit
calls. Type
int or array, shape (n_features,)
See also
sklearn.preprocessing.StandardScaler
Vanilla scikitlearn version, without differential privacy.
PCA
Further removes the linear correlation across features with ‘whiten=True’.
Notes
NaNs are treated as missing values: disregarded in fit, and maintained in transform.

fit
(X, y=None)[source]¶ Compute the mean and std to be used for later scaling.
 Parameters
X ({arraylike, sparse matrix}, shape [n_samples, n_features]) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y – Ignored

fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
 Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
 Returns
X_new – Transformed array.
 Return type
numpy array of shape [n_samples, n_features_new]

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
params – Parameter names mapped to their values.
 Return type
mapping of string to any

inverse_transform
(X, copy=None)[source]¶ Scale back the data to the original representation
 Parameters
X (arraylike, shape [n_samples, n_features]) – The data used to scale along the features axis.
copy (bool, optional (default: None)) – Copy the input X or not.
 Returns
X_tr – Transformed array.
 Return type
arraylike, shape [n_samples, n_features]

partial_fit
(X, y=None)[source]¶ Online computation of mean and std with differential privacy on X for later scaling. All of X is processed as a single batch. This is intended for cases when fit is not feasible due to very large number of n_samples or because X is read from a continuous stream.
The algorithm for incremental mean and std is given in Equation 1.5a,b in Chan, Tony F., Gene H. Golub, and Randall J. LeVeque. “Algorithms for computing the sample variance: Analysis and recommendations.” The American Statistician 37.3 (1983): 242247:
 Parameters
X ({arraylike}, shape [n_samples, n_features]) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y – Ignored

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.