diffprivlib.models
Machine learning models with differential privacy
Classification models
Gaussian Naive Bayes
- class diffprivlib.models.GaussianNB(*, epsilon=1.0, bounds=None, priors=None, var_smoothing=1e-09, random_state=None, accountant=None)[source]
Gaussian Naive Bayes (GaussianNB) with differential privacy
Inherits the
sklearn.naive_bayes.GaussianNB
class from Scikit Learn and adds noise to satisfy differential privacy to the learned means and variances. Adapted from the work presented in [VSB13].- Parameters:
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\) for the model.
bounds (tuple, optional) – Bounds of the data, provided as a tuple of the form (min, max). min and max can either be scalars, covering the min/max of the entire data, or vectors with one entry per feature. If not provided, the bounds are computed on the data when
.fit()
is first called, resulting in aPrivacyLeakWarning
.priors (array-like, shape (n_classes,)) – Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
var_smoothing (float, default: 1e-9) – Portion of the largest variance of all features that is added to variances for calculation stability.
random_state (int or RandomState, optional) – Controls the randomness of the model. To obtain a deterministic behaviour during randomisation,
random_state
has to be fixed to an integer.accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.
- class_prior_
probability of each class.
- Type:
array, shape (n_classes,)
- class_count_
number of training samples observed in each class.
- Type:
array, shape (n_classes,)
- theta_
mean of each feature per class
- Type:
array, shape (n_classes, n_features)
- var_
variance of each feature per class
- Type:
array, shape (n_classes, n_features)
- epsilon_
absolute additive value to variances (unrelated to
epsilon
parameter for differential privacy)- Type:
References
[VSB13]Vaidya, Jaideep, Basit Shafiq, Anirban Basu, and Yuan Hong. “Differentially private naive bayes classification.” In 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), vol. 1, pp. 571-576. IEEE, 2013.
- fit(X, y, sample_weight=None)[source]
Fit Gaussian Naive Bayes according to X, y.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – Target values.
sample_weight (array-like of shape (n_samples,), default=None) –
Weights applied to individual samples (1. for unweighted).
New in version 0.17: Gaussian Naive Bayes supports fitting with sample_weight.
- Returns:
self – Returns the instance itself.
- Return type:
- get_metadata_routing()
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequest
encapsulating routing information.- Return type:
MetadataRequest
- get_params(deep=True)
Get parameters for this estimator.
- partial_fit(X, y, classes=None, sample_weight=None)[source]
Incremental fit on a batch of samples.
This method is expected to be called several times consecutively on different chunks of a dataset so as to implement out-of-core or online learning.
This is especially useful when the whole dataset is too big to fit in memory at once.
This method has some performance and numerical stability overhead, hence it is better to call partial_fit on chunks of data that are as large as possible (as long as fitting in the memory budget) to hide the overhead.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – Target values.
classes (array-like of shape (n_classes,), default=None) –
List of all the classes that can possibly appear in the y vector.
Must be provided at the first call to partial_fit, can be omitted in subsequent calls.
sample_weight (array-like of shape (n_samples,), default=None) –
Weights applied to individual samples (1. for unweighted).
New in version 0.17.
- Returns:
self – Returns the instance itself.
- Return type:
- predict(X)
Perform classification on an array of test vectors X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – The input samples.
- Returns:
C – Predicted target values for X.
- Return type:
ndarray of shape (n_samples,)
- predict_joint_log_proba(X)
Return joint log probability estimates for the test vector X.
For each row x of X and class y, the joint log probability is given by
log P(x, y) = log P(y) + log P(x|y),
wherelog P(y)
is the class prior probability andlog P(x|y)
is the class-conditional probability.- Parameters:
X (array-like of shape (n_samples, n_features)) – The input samples.
- Returns:
C – Returns the joint log-probability of the samples for each class in the model. The columns correspond to the classes in sorted order, as they appear in the attribute classes_.
- Return type:
ndarray of shape (n_samples, n_classes)
- predict_log_proba(X)
Return log-probability estimates for the test vector X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – The input samples.
- Returns:
C – Returns the log-probability of the samples for each class in the model. The columns correspond to the classes in sorted order, as they appear in the attribute classes_.
- Return type:
array-like of shape (n_samples, n_classes)
- predict_proba(X)
Return probability estimates for the test vector X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – The input samples.
- Returns:
C – Returns the probability of the samples for each class in the model. The columns correspond to the classes in sorted order, as they appear in the attribute classes_.
- Return type:
array-like of shape (n_samples, n_classes)
- score(X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score – Mean accuracy of
self.predict(X)
w.r.t. y.- Return type:
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GaussianNB
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_partial_fit_request(*, classes: bool | None | str = '$UNCHANGED$', sample_weight: bool | None | str = '$UNCHANGED$') GaussianNB
Request metadata passed to the
partial_fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topartial_fit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topartial_fit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
- Returns:
self – The updated object.
- Return type:
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GaussianNB
Request metadata passed to the
score
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- property sigma_
Variance of each feature per class.
Logistic Regression
- class diffprivlib.models.LogisticRegression(*, epsilon=1.0, data_norm=None, tol=0.0001, C=1.0, fit_intercept=True, max_iter=100, verbose=0, warm_start=False, n_jobs=None, random_state=None, accountant=None, **unused_args)[source]
Logistic Regression (aka logit, MaxEnt) classifier with differential privacy.
This class implements regularised logistic regression using Scipy’s L-BFGS-B algorithm. \(\epsilon\)-Differential privacy is achieved relative to the maximum norm of the data, as determined by data_norm, by the
Vector
mechanism, which adds a Laplace-distributed random vector to the objective. Adapted from the work presented in [CMS11].This class is a child of
sklearn.linear_model.LogisticRegression
, with amendments to allow for the implementation of differential privacy. Some parameters of Scikit Learn’s model have therefore had to be fixed, including:The only permitted solver is ‘lbfgs’. Specifying the
solver
option will result in a warning.Consequently, the only permitted penalty is ‘l2’. Specifying the
penalty
option will result in a warning.In the multiclass case, only the one-vs-rest (OvR) scheme is permitted. Specifying the
multi_class
option will result in a warning.
- Parameters:
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).
data_norm (float, optional) –
The max l2 norm of any row of the data. This defines the spread of data that will be protected by differential privacy.
If not specified, the max norm is taken from the data when
.fit()
is first called, but will result in aPrivacyLeakWarning
, as it reveals information about the data. To preserve differential privacy fully, data_norm should be selected independently of the data, i.e. with domain knowledge.tol (float, default: 1e-4) – Tolerance for stopping criteria.
C (float, default: 1.0) – Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
fit_intercept (bool, default: True) – Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.
max_iter (int, default: 100) – Maximum number of iterations taken for the solver to converge. For smaller epsilon (more noise), max_iter may need to be increased.
verbose (int, default: 0) – Set to any positive number for verbosity.
warm_start (bool, default: False) – When set to
True
, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.n_jobs (int, optional) – Number of CPU cores used when parallelising over classes.
None
means 1 unless in a context.-1
means using all processors.random_state (int or RandomState, optional) – Controls the randomness of the model. To obtain a deterministic behaviour during randomisation,
random_state
has to be fixed to an integer.accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.
- classes_
A list of class labels known to the classifier.
- Type:
array, shape (n_classes, )
- coef_
Coefficient of the features in the decision function.
coef_ is of shape (1, n_features) when the given problem is binary.
- Type:
array, shape (1, n_features) or (n_classes, n_features)
- intercept_
Intercept (a.k.a. bias) added to the decision function.
If fit_intercept is set to False, the intercept is set to zero. intercept_ is of shape (1,) when the given problem is binary.
- Type:
array, shape (1,) or (n_classes,)
- n_iter_
Actual number of iterations for all classes. If binary, it returns only 1 element.
- Type:
array, shape (n_classes,) or (1, )
Examples
>>> from sklearn.datasets import load_iris >>> from diffprivlib.models import LogisticRegression >>> X, y = load_iris(return_X_y=True) >>> clf = LogisticRegression(data_norm=12, epsilon=2).fit(X, y) >>> clf.predict(X[:2, :]) array([0, 0]) >>> clf.predict_proba(X[:2, :]) array([[7.35362932e-01, 2.16667422e-14, 2.64637068e-01], [9.08384378e-01, 3.47767052e-13, 9.16156215e-02]]) >>> clf.score(X, y) 0.5266666666666666
See also
sklearn.linear_model.LogisticRegression
The implementation of logistic regression in scikit-learn, upon which this implementation is built.
Vector
The mechanism used by the model to achieve differential privacy.
References
[CMS11]Chaudhuri, Kamalika, Claire Monteleoni, and Anand D. Sarwate. “Differentially private empirical risk minimization.” Journal of Machine Learning Research 12, no. Mar (2011): 1069-1109.
- decision_function(X)
Predict confidence scores for samples.
The confidence score for a sample is proportional to the signed distance of that sample to the hyperplane.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data matrix for which we want to get the confidence scores.
- Returns:
scores – Confidence scores per (n_samples, n_classes) combination. In the binary case, confidence score for self.classes_[1] where >0 means this class would be predicted.
- Return type:
ndarray of shape (n_samples,) or (n_samples, n_classes)
- densify()
Convert coefficient matrix to dense array format.
Converts the
coef_
member (back) to a numpy.ndarray. This is the default format ofcoef_
and is required for fitting, so calling this method is only required on models that have previously been sparsified; otherwise, it is a no-op.- Returns:
Fitted estimator.
- Return type:
self
- fit(X, y, sample_weight=None)[source]
Fit the model according to the given training data.
- Parameters:
X ({array-like, sparse matrix}, shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape (n_samples,)) – Target vector relative to X.
sample_weight (ignored) – Ignored by diffprivlib. Present for consistency with sklearn API.
- Returns:
self
- Return type:
class
- get_metadata_routing()
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequest
encapsulating routing information.- Return type:
MetadataRequest
- get_params(deep=True)
Get parameters for this estimator.
- predict(X)
Predict class labels for samples in X.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data matrix for which we want to get the predictions.
- Returns:
y_pred – Vector containing the class labels for each sample.
- Return type:
ndarray of shape (n_samples,)
- predict_log_proba(X)[source]
Predict logarithm of probability estimates.
The returned estimates for all classes are ordered by the label of classes.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Vector to be scored, where n_samples is the number of samples and n_features is the number of features.
- Returns:
T – Returns the log-probability of the sample for each class in the model, where classes are ordered as they are in
self.classes_
.- Return type:
array-like of shape (n_samples, n_classes)
- predict_proba(X)[source]
Probability estimates.
The returned estimates for all classes are ordered by the label of classes.
For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class. Else use a one-vs-rest approach, i.e. calculate the probability of each class assuming it to be positive using the logistic function. and normalize these values across all the classes.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Vector to be scored, where n_samples is the number of samples and n_features is the number of features.
- Returns:
T – Returns the probability of the sample for each class in the model, where classes are ordered as they are in
self.classes_
.- Return type:
array-like of shape (n_samples, n_classes)
- score(X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score – Mean accuracy of
self.predict(X)
w.r.t. y.- Return type:
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') LogisticRegression
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') LogisticRegression
Request metadata passed to the
score
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- sparsify()
Convert coefficient matrix to sparse format.
Converts the
coef_
member to a scipy.sparse matrix, which for L1-regularized models can be much more memory- and storage-efficient than the usual numpy.ndarray representation.The
intercept_
member is not converted.- Returns:
Fitted estimator.
- Return type:
self
Notes
For non-sparse models, i.e. when there are not many zeros in
coef_
, this may actually increase memory usage, so use this method with care. A rule of thumb is that the number of zero elements, which can be computed with(coef_ == 0).sum()
, must be more than 50% for this to provide significant benefits.After calling this method, further fitting with the partial_fit method (if any) will not work until you call densify.
Tree-Based Models
- class diffprivlib.models.RandomForestClassifier(n_estimators=10, *, epsilon=1.0, bounds=None, classes=None, n_jobs=1, verbose=0, accountant=None, random_state=None, max_depth=5, warm_start=False, shuffle=False, **unused_args)[source]
Random Forest Classifier with differential privacy.
This class implements Differentially Private Random Decision Forests using [1]. \(\epsilon\)-Differential privacy is achieved by constructing decision trees via random splitting criterion and applying the
PermuteAndFlip
Mechanism to determine a noisy label.- Parameters:
n_estimators (int, default: 10) – The number of trees in the forest.
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).
bounds (tuple, optional) – Bounds of the data, provided as a tuple of the form (min, max). min and max can either be scalars, covering the min/max of the entire data, or vectors with one entry per feature. If not provided, the bounds are computed on the data when
.fit()
is first called, resulting in aPrivacyLeakWarning
.classes (array-like of shape (n_classes,)) – Array of classes to be trained on. If not provided, the classes will be read from the data when
.fit()
is first called, resulting in aPrivacyLeakWarning
.n_jobs (int, default: 1) – Number of CPU cores used when parallelising over classes.
-1
means using all processors.verbose (int, default: 0) – Set to any positive number for verbosity.
random_state (int or RandomState, optional) – Controls both the randomness of the shuffling of the samples used when building trees (if
shuffle=True
) and training of the differentially-privateDecisionTreeClassifier
to construct the forest. To obtain a deterministic behaviour during randomisation,random_state
has to be fixed to an integer.accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.
max_depth (int, default: 5) – The maximum depth of the tree. The depth translates to an exponential increase in memory usage.
warm_start (bool, default=False) – When set to
True
, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.shuffle (bool, default=False) – When set to
True
, shuffles the datapoints to be trained on trees at random. In diffprivlib, each datapoint is used to train exactly one tree. When set toFalse
, datapoints are chosen in-order to their tree in sequence.
- estimator_
The child estimator template used to create the collection of fitted sub-estimators.
- Type:
- estimators_
The collection of fitted sub-estimators.
- Type:
- classes_
The classes labels.
- Type:
ndarray of shape (n_classes,) or a list of such arrays
- feature_names_in_
Names of features seen during fit. Defined only when X has feature names that are all strings.
- Type:
ndarray of shape (n_features_in_,)
Examples
>>> from sklearn.datasets import make_classification >>> from diffprivlib.models import RandomForestClassifier >>> X, y = make_classification(n_samples=1000, n_features=4, ... n_informative=2, n_redundant=0, ... random_state=0, shuffle=False) >>> clf = RandomForestClassifier(n_estimators=100, random_state=0) >>> clf.fit(X, y) >>> print(clf.predict([[0, 0, 0, 0]])) [1]
References
[1] Sam Fletcher, Md Zahidul Islam. “Differentially Private Random Decision Forests using Smooth Sensitivity” https://arxiv.org/abs/1606.03572
- apply(X)
Apply trees in the forest to X, return leaf indices.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples. Internally, its dtype will be converted to
dtype=np.float32
. If a sparse matrix is provided, it will be converted into a sparsecsr_matrix
.- Returns:
X_leaves – For each datapoint x in X and for each tree in the forest, return the index of the leaf x ends up in.
- Return type:
ndarray of shape (n_samples, n_estimators)
- decision_path(X)
Return the decision path in the forest.
New in version 0.18.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples. Internally, its dtype will be converted to
dtype=np.float32
. If a sparse matrix is provided, it will be converted into a sparsecsr_matrix
.- Returns:
indicator (sparse matrix of shape (n_samples, n_nodes)) – Return a node indicator matrix where non zero elements indicates that the samples goes through the nodes. The matrix is of CSR format.
n_nodes_ptr (ndarray of shape (n_estimators + 1,)) – The columns from indicator[n_nodes_ptr[i]:n_nodes_ptr[i+1]] gives the indicator value for the i-th estimator.
- property estimators_samples_
The subset of drawn samples for each base estimator.
Returns a dynamically generated list of indices identifying the samples used for fitting each member of the ensemble, i.e., the in-bag samples.
Note: the list is re-created at each call to the property in order to reduce the object memory footprint by not storing the sampling data. Thus fetching the property may be slower than expected.
- fit(X, y, sample_weight=None)[source]
Build a forest of trees from the training set (X, y).
- Parameters:
X (array-like of shape (n_samples, n_features)) – The training input samples. Internally, its dtype will be converted to
dtype=np.float32
.y (array-like of shape (n_samples,)) – The target values (class labels in classification, real numbers in regression).
sample_weight (ignored) – Ignored by diffprivlib. Present for consistency with sklearn API.
- Returns:
self – Fitted estimator.
- Return type:
- get_metadata_routing()
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequest
encapsulating routing information.- Return type:
MetadataRequest
- get_params(deep=True)
Get parameters for this estimator.
- predict(X)
Predict class for X.
The predicted class of an input sample is a vote by the trees in the forest, weighted by their probability estimates. That is, the predicted class is the one with highest mean probability estimate across the trees.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples. Internally, its dtype will be converted to
dtype=np.float32
. If a sparse matrix is provided, it will be converted into a sparsecsr_matrix
.- Returns:
y – The predicted classes.
- Return type:
ndarray of shape (n_samples,) or (n_samples, n_outputs)
- predict_log_proba(X)
Predict class log-probabilities for X.
The predicted class log-probabilities of an input sample is computed as the log of the mean predicted class probabilities of the trees in the forest.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples. Internally, its dtype will be converted to
dtype=np.float32
. If a sparse matrix is provided, it will be converted into a sparsecsr_matrix
.- Returns:
p – The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.
- Return type:
ndarray of shape (n_samples, n_classes), or a list of such arrays
- predict_proba(X)
Predict class probabilities for X.
The predicted class probabilities of an input sample are computed as the mean predicted class probabilities of the trees in the forest. The class probability of a single tree is the fraction of samples of the same class in a leaf.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples. Internally, its dtype will be converted to
dtype=np.float32
. If a sparse matrix is provided, it will be converted into a sparsecsr_matrix
.- Returns:
p – The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.
- Return type:
ndarray of shape (n_samples, n_classes), or a list of such arrays
- score(X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score – Mean accuracy of
self.predict(X)
w.r.t. y.- Return type:
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') RandomForestClassifier
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') RandomForestClassifier
Request metadata passed to the
score
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- class diffprivlib.models.DecisionTreeClassifier(max_depth=5, *, epsilon=1, bounds=None, classes=None, random_state=None, accountant=None, criterion=None, **unused_args)[source]
Decision Tree Classifier with differential privacy.
This class implements the base differentially private decision tree classifier for the Random Forest classifier algorithm. Not meant to be used separately.
- Parameters:
max_depth (int, default: 5) – The maximum depth of the tree.
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).
bounds (tuple, optional) – Bounds of the data, provided as a tuple of the form (min, max). min and max can either be scalars, covering the min/max of the entire data, or vectors with one entry per feature. If not provided, the bounds are computed on the data when
.fit()
is first called, resulting in aPrivacyLeakWarning
.classes (array-like of shape (n_classes,), optional) – Array of class labels. If not provided, the classes will be read from the data when
.fit()
is first called, resulting in aPrivacyLeakWarning
.random_state (int or RandomState, optional) – Controls the randomness of the estimator. At each split, the feature to split on is chosen randomly, as is the threshold at which to split. The classification label at each leaf is then randomised, subject to differential privacy constraints. To obtain a deterministic behaviour during randomisation,
random_state
has to be fixed to an integer.accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.
- classes_
The class labels.
- Type:
array of shape (n_classes, )
- apply(X, check_input=True)
Return the index of the leaf that each sample is predicted as.
New in version 0.17.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples. Internally, it will be converted to
dtype=np.float32
and if a sparse matrix is provided to a sparsecsr_matrix
.check_input (bool, default=True) – Allow to bypass several input checking. Don’t use this parameter unless you know what you’re doing.
- Returns:
X_leaves – For each datapoint x in X, return the index of the leaf x ends up in. Leaves are numbered within
[0; self.tree_.node_count)
, possibly with gaps in the numbering.- Return type:
array-like of shape (n_samples,)
- decision_path(X, check_input=True)
Return the decision path in the tree.
New in version 0.18.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples. Internally, it will be converted to
dtype=np.float32
and if a sparse matrix is provided to a sparsecsr_matrix
.check_input (bool, default=True) – Allow to bypass several input checking. Don’t use this parameter unless you know what you’re doing.
- Returns:
indicator – Return a node indicator CSR matrix where non zero elements indicates that the samples goes through the nodes.
- Return type:
sparse matrix of shape (n_samples, n_nodes)
- fit(X, y, sample_weight=None, check_input=True)[source]
Build a differentially-private decision tree classifier from the training set (X, y).
- Parameters:
X (array-like of shape (n_samples, n_features)) – The training input samples. Internally, it will be converted to
dtype=np.float32
.y (array-like of shape (n_samples,)) – The target values (class labels) as integers or strings.
sample_weight (ignored) – Ignored by diffprivlib. Present for consistency with sklearn API.
check_input (bool, default=True) – Allow to bypass several input checking. Don’t use this parameter unless you know what you do.
- Returns:
self – Fitted estimator.
- Return type:
- get_depth()
Return the depth of the decision tree.
The depth of a tree is the maximum distance between the root and any leaf.
- Returns:
self.tree_.max_depth – The maximum depth of the tree.
- Return type:
- get_metadata_routing()
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequest
encapsulating routing information.- Return type:
MetadataRequest
- get_n_leaves()
Return the number of leaves of the decision tree.
- Returns:
self.tree_.n_leaves – Number of leaves.
- Return type:
- get_params(deep=True)
Get parameters for this estimator.
- predict(X, check_input=True)
Predict class or regression value for X.
For a classification model, the predicted class for each sample in X is returned. For a regression model, the predicted value based on X is returned.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples. Internally, it will be converted to
dtype=np.float32
and if a sparse matrix is provided to a sparsecsr_matrix
.check_input (bool, default=True) – Allow to bypass several input checking. Don’t use this parameter unless you know what you’re doing.
- Returns:
y – The predicted classes, or the predict values.
- Return type:
array-like of shape (n_samples,) or (n_samples, n_outputs)
- predict_log_proba(X)[source]
Predict class log-probabilities of the input samples X.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples. Internally, it will be converted to
dtype=np.float32
and if a sparse matrix is provided to a sparsecsr_matrix
.- Returns:
proba – The class log-probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.
- Return type:
ndarray of shape (n_samples, n_classes) or list of n_outputs such arrays if n_outputs > 1
- predict_proba(X, check_input=True)[source]
Predict class probabilities of the input samples X.
The predicted class probability is the fraction of samples of the same class in a leaf.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples. Internally, it will be converted to
dtype=np.float32
and if a sparse matrix is provided to a sparsecsr_matrix
.check_input (bool, default=True) – Allow to bypass several input checking. Don’t use this parameter unless you know what you’re doing.
- Returns:
proba – The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.
- Return type:
ndarray of shape (n_samples, n_classes) or list of n_outputs such arrays if n_outputs > 1
- score(X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score – Mean accuracy of
self.predict(X)
w.r.t. y.- Return type:
- set_fit_request(*, check_input: bool | None | str = '$UNCHANGED$', sample_weight: bool | None | str = '$UNCHANGED$') DecisionTreeClassifier
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
- Returns:
self – The updated object.
- Return type:
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_predict_proba_request(*, check_input: bool | None | str = '$UNCHANGED$') DecisionTreeClassifier
Request metadata passed to the
predict_proba
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict_proba
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topredict_proba
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- set_predict_request(*, check_input: bool | None | str = '$UNCHANGED$') DecisionTreeClassifier
Request metadata passed to the
predict
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topredict
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') DecisionTreeClassifier
Request metadata passed to the
score
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
Regression models
Linear Regression
- class diffprivlib.models.LinearRegression(*, epsilon=1.0, bounds_X=None, bounds_y=None, fit_intercept=True, copy_X=True, random_state=None, accountant=None, **unused_args)[source]
Ordinary least squares Linear Regression with differential privacy.
LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. Differential privacy is guaranteed with respect to the training sample.
Differential privacy is achieved by adding noise to the coefficients of the objective function, taking inspiration from [ZZX12].
- Parameters:
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).
bounds_X (tuple) – Bounds of the data, provided as a tuple of the form (min, max). min and max can either be scalars, covering the min/max of the entire data, or vectors with one entry per feature. If not provided, the bounds are computed on the data when
.fit()
is first called, resulting in aPrivacyLeakWarning
.bounds_y (tuple) – Same as bounds_X, but for the training label set y.
fit_intercept (bool, default: True) – Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).
copy_X (bool, default: True) – If True, X will be copied; else, it may be overwritten.
random_state (int or RandomState, optional) – Controls the randomness of the model. To obtain a deterministic behaviour during randomisation,
random_state
has to be fixed to an integer.accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.
- coef_
Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features.
- Type:
array of shape (n_features, ) or (n_targets, n_features)
- intercept_
Independent term in the linear model. Set to 0.0 if fit_intercept = False.
- Type:
float or array of shape of (n_targets,)
References
[ZZX12]Zhang, Jun, Zhenjie Zhang, Xiaokui Xiao, Yin Yang, and Marianne Winslett. “Functional mechanism: regression analysis under differential privacy.” arXiv preprint arXiv:1208.0219 (2012).
- fit(X, y, sample_weight=None)[source]
Fit linear model.
- Parameters:
X (array-like or sparse matrix, shape (n_samples, n_features)) – Training data
y (array_like, shape (n_samples, n_targets)) – Target values. Will be cast to X’s dtype if necessary
sample_weight (ignored) – Ignored by diffprivlib. Present for consistency with sklearn API.
- Returns:
self
- Return type:
returns an instance of self.
- get_metadata_routing()
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequest
encapsulating routing information.- Return type:
MetadataRequest
- get_params(deep=True)
Get parameters for this estimator.
- predict(X)
Predict using the linear model.
- Parameters:
X (array-like or sparse matrix, shape (n_samples, n_features)) – Samples.
- Returns:
C – Returns predicted values.
- Return type:
array, shape (n_samples,)
- score(X, y, sample_weight=None)
Return the coefficient of determination of the prediction.
The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares
((y_true - y_pred)** 2).sum()
and \(v\) is the total sum of squares((y_true - y_true.mean()) ** 2).sum()
. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape
(n_samples, n_samples_fitted)
, wheren_samples_fitted
is the number of samples used in the fitting for the estimator.y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score – \(R^2\) of
self.predict(X)
w.r.t. y.- Return type:
Notes
The \(R^2\) score used when calling
score
on a regressor usesmultioutput='uniform_average'
from version 0.23 to keep consistent with default value ofr2_score()
. This influences thescore
method of all the multioutput regressors (except forMultiOutputRegressor
).
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') LinearRegression
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') LinearRegression
Request metadata passed to the
score
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
Clustering models
K-Means
- class diffprivlib.models.KMeans(n_clusters=8, *, epsilon=1.0, bounds=None, random_state=None, accountant=None, **unused_args)[source]
K-Means clustering with differential privacy.
Implements the DPLloyd approach presented in [SCL16], leveraging the
sklearn.cluster.KMeans
class for full integration with Scikit Learn.- Parameters:
n_clusters (int, default: 8) – The number of clusters to form as well as the number of centroids to generate.
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).
bounds (tuple, optional) – Bounds of the data, provided as a tuple of the form (min, max). min and max can either be scalars, covering the min/max of the entire data, or vectors with one entry per feature. If not provided, the bounds are computed on the data when
.fit()
is first called, resulting in aPrivacyLeakWarning
.random_state (int or RandomState, optional) – Controls the randomness of the model. To obtain a deterministic behaviour during randomisation,
random_state
has to be fixed to an integer.accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.
- cluster_centers_
Coordinates of cluster centers. If the algorithm stops before fully converging, these will not be consistent with
labels_
.- Type:
array, [n_clusters, n_features]
- labels_
Labels of each point
References
[SCL16]Su, Dong, Jianneng Cao, Ninghui Li, Elisa Bertino, and Hongxia Jin. “Differentially private k-means clustering.” In Proceedings of the sixth ACM conference on data and application security and privacy, pp. 26-37. ACM, 2016.
- fit(X, y=None, sample_weight=None)[source]
Computes k-means clustering with differential privacy.
- Parameters:
X (array-like, shape=(n_samples, n_features)) – Training instances to cluster.
y (Ignored) – not used, present here for API consistency by convention.
sample_weight (ignored) – Ignored by diffprivlib. Present for consistency with sklearn API.
- Returns:
self
- Return type:
class
- fit_predict(X, y=None, sample_weight=None)
Compute cluster centers and predict cluster index for each sample.
Convenience method; equivalent to calling fit(X) followed by predict(X).
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data to transform.
y (Ignored) – Not used, present here for API consistency by convention.
sample_weight (array-like of shape (n_samples,), default=None) – The weights for each observation in X. If None, all observations are assigned equal weight.
- Returns:
labels – Index of the cluster each sample belongs to.
- Return type:
ndarray of shape (n_samples,)
- fit_transform(X, y=None, sample_weight=None)
Compute clustering and transform X to cluster-distance space.
Equivalent to fit(X).transform(X), but more efficiently implemented.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data to transform.
y (Ignored) – Not used, present here for API consistency by convention.
sample_weight (array-like of shape (n_samples,), default=None) – The weights for each observation in X. If None, all observations are assigned equal weight.
- Returns:
X_new – X transformed in the new space.
- Return type:
ndarray of shape (n_samples, n_clusters)
- get_feature_names_out(input_features=None)
Get output feature names for transformation.
The feature names out will prefixed by the lowercased class name. For example, if the transformer outputs 3 features, then the feature names out are: [“class_name0”, “class_name1”, “class_name2”].
- Parameters:
input_features (array-like of str or None, default=None) – Only used to validate feature names with the names seen in fit.
- Returns:
feature_names_out – Transformed feature names.
- Return type:
ndarray of str objects
- get_metadata_routing()
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequest
encapsulating routing information.- Return type:
MetadataRequest
- get_params(deep=True)
Get parameters for this estimator.
- predict(X, sample_weight='deprecated')
Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data to predict.
sample_weight (array-like of shape (n_samples,), default=None) –
The weights for each observation in X. If None, all observations are assigned equal weight.
Deprecated since version 1.3: The parameter sample_weight is deprecated in version 1.3 and will be removed in 1.5.
- Returns:
labels – Index of the cluster each sample belongs to.
- Return type:
ndarray of shape (n_samples,)
- score(X, y=None, sample_weight=None)
Opposite of the value of X on the K-means objective.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data.
y (Ignored) – Not used, present here for API consistency by convention.
sample_weight (array-like of shape (n_samples,), default=None) – The weights for each observation in X. If None, all observations are assigned equal weight.
- Returns:
score – Opposite of the value of X on the K-means objective.
- Return type:
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') KMeans
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- set_output(*, transform=None)
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
transform ({"default", "pandas"}, default=None) –
Configure output of transform and fit_transform.
”default”: Default output format of a transformer
”pandas”: DataFrame output
”polars”: Polars output
None: Transform configuration is unchanged
New in version 1.4: “polars” option was added.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_predict_request(*, sample_weight: bool | None | str = '$UNCHANGED$') KMeans
Request metadata passed to the
predict
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topredict
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') KMeans
Request metadata passed to the
score
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- transform(X)
Transform X to a cluster-distance space.
In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the array returned by transform will typically be dense.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data to transform.
- Returns:
X_new – X transformed in the new space.
- Return type:
ndarray of shape (n_samples, n_clusters)
Dimensionality reduction models
PCA
- class diffprivlib.models.PCA(n_components=None, *, epsilon=1.0, data_norm=None, centered=False, bounds=None, copy=True, whiten=False, random_state=None, accountant=None, **unused_args)[source]
Principal component analysis (PCA) with differential privacy.
This class is a child of
sklearn.decomposition.PCA
, with amendments to allow for the implementation of differential privacy as given in [IS16b]. Some parameters of Scikit Learn’s model have therefore had to be fixed, including:The only permitted svd_solver is ‘full’. Specifying the
svd_solver
option will result in a warning;The parameters
tol
anditerated_power
are not applicable (as a consequence of fixingsvd_solver = 'full'
).
- Parameters:
n_components (int, float, None or str) –
Number of components to keep. If n_components is not set all components are kept:
n_components == min(n_samples, n_features)
If
n_components == 'mle'
, Minka’s MLE is used to guess the dimension.If
0 < n_components < 1
, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.Hence, the None case results in:
n_components == min(n_samples, n_features) - 1
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\). If
centered=False
, half of epsilon is used to calculate the differentially private mean to center the data prior to the calculation of principal components.data_norm (float, optional) –
The max l2 norm of any row of the data. This defines the spread of data that will be protected by differential privacy.
If not specified, the max norm is taken from the data when
.fit()
is first called, but will result in aPrivacyLeakWarning
, as it reveals information about the data. To preserve differential privacy fully, data_norm should be selected independently of the data, i.e. with domain knowledge.centered (bool, default: False) –
If False, the data will be centered before calculating the principal components. This will be calculated with differential privacy, consuming privacy budget from epsilon.
If True, the data is assumed to have been centered previously (e.g. using
StandardScaler
), and therefore will not require the consumption of privacy budget to calculate the mean.bounds (tuple, optional) – Bounds of the data, provided as a tuple of the form (min, max). min and max can either be scalars, covering the min/max of the entire data, or vectors with one entry per feature. If not provided, the bounds are computed on the data when
.fit()
is first called, resulting in aPrivacyLeakWarning
.copy (bool, default: True) – If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.
whiten (bool, default: False) –
When True (False by default) the components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.
Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.
random_state (int or RandomState, optional) – Controls the randomness of the model. To obtain a deterministic behaviour during randomisation,
random_state
has to be fixed to an integer.accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.
- components_
Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by
explained_variance_
.- Type:
array, shape (n_components, n_features)
- explained_variance_
The amount of variance explained by each of the selected components.
Equal to n_components largest eigenvalues of the covariance matrix of X.
- Type:
array, shape (n_components,)
- explained_variance_ratio_
Percentage of variance explained by each of the selected components.
If
n_components
is not set then all components are stored and the sum of the ratios is equal to 1.0.- Type:
array, shape (n_components,)
- singular_values_
The singular values corresponding to each of the selected components. The singular values are equal to the 2-norms of the
n_components
variables in the lower-dimensional space.- Type:
array, shape (n_components,)
- mean_
Per-feature empirical mean, estimated from the training set.
Equal to X.mean(axis=0).
- Type:
array, shape (n_features,)
- n_components_
The estimated number of components. When n_components is set to ‘mle’ or a number between 0 and 1 (with svd_solver == ‘full’) this number is estimated from input data. Otherwise it equals the parameter n_components, or the lesser value of n_features and n_samples if n_components is None.
- Type:
- noise_variance_
The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf. It is required to compute the estimated data covariance and score samples.
Equal to the average of (min(n_features, n_samples) - n_components) smallest eigenvalues of the covariance matrix of X.
- Type:
See also
sklearn.decomposition.PCA
Scikit-learn implementation Principal Component Analysis.
References
[IS16b]Imtiaz, Hafiz, and Anand D. Sarwate. “Symmetric matrix perturbation for differentially-private principal component analysis.” In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2339-2343. IEEE, 2016.
- fit(X, y=None)[source]
Fit the model with X.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (Ignored) – Ignored.
- Returns:
self – Returns the instance itself.
- Return type:
- fit_transform(X, y=None)[source]
Fit the model with X and apply the dimensionality reduction on X.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (Ignored) – Ignored.
- Returns:
X_new – Transformed values.
- Return type:
ndarray of shape (n_samples, n_components)
Notes
This method returns a Fortran-ordered array. To convert it to a C-ordered array, use ‘np.ascontiguousarray’.
- get_covariance()
Compute data covariance with the generative model.
cov = components_.T * S**2 * components_ + sigma2 * eye(n_features)
where S**2 contains the explained variances, and sigma2 contains the noise variances.- Returns:
cov – Estimated covariance of data.
- Return type:
array of shape=(n_features, n_features)
- get_feature_names_out(input_features=None)
Get output feature names for transformation.
The feature names out will prefixed by the lowercased class name. For example, if the transformer outputs 3 features, then the feature names out are: [“class_name0”, “class_name1”, “class_name2”].
- Parameters:
input_features (array-like of str or None, default=None) – Only used to validate feature names with the names seen in fit.
- Returns:
feature_names_out – Transformed feature names.
- Return type:
ndarray of str objects
- get_metadata_routing()
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequest
encapsulating routing information.- Return type:
MetadataRequest
- get_params(deep=True)
Get parameters for this estimator.
- get_precision()
Compute data precision matrix with the generative model.
Equals the inverse of the covariance but computed with the matrix inversion lemma for efficiency.
- Returns:
precision – Estimated precision of data.
- Return type:
array, shape=(n_features, n_features)
- inverse_transform(X)
Transform data back to its original space.
In other words, return an input X_original whose transform would be X.
- Parameters:
X (array-like of shape (n_samples, n_components)) – New data, where n_samples is the number of samples and n_components is the number of components.
- Returns:
Original data, where n_samples is the number of samples and n_features is the number of features.
- Return type:
X_original array-like of shape (n_samples, n_features)
Notes
If whitening is enabled, inverse_transform will compute the exact inverse operation, which includes reversing whitening.
- score(X, y=None)[source]
Return the average log-likelihood of all samples.
See. “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf
- Parameters:
X (array-like of shape (n_samples, n_features)) – The data.
y (Ignored) – Ignored.
- Returns:
ll – Average log-likelihood of the samples under the current model.
- Return type:
- score_samples(X)[source]
Return the log-likelihood of each sample.
See. “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf
- Parameters:
X (array-like of shape (n_samples, n_features)) – The data.
- Returns:
ll – Log-likelihood of each sample under the current model.
- Return type:
ndarray of shape (n_samples,)
- set_output(*, transform=None)
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
transform ({"default", "pandas"}, default=None) –
Configure output of transform and fit_transform.
”default”: Default output format of a transformer
”pandas”: DataFrame output
”polars”: Polars output
None: Transform configuration is unchanged
New in version 1.4: “polars” option was added.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- transform(X)
Apply dimensionality reduction to X.
X is projected on the first principal components previously extracted from a training set.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data, where n_samples is the number of samples and n_features is the number of features.
- Returns:
X_new – Projection of X in the first principal components, where n_samples is the number of samples and n_components is the number of the components.
- Return type:
array-like of shape (n_samples, n_components)
Preprocessing
Standard Scaler
- class diffprivlib.models.StandardScaler(*, epsilon=1.0, bounds=None, copy=True, with_mean=True, with_std=True, random_state=None, accountant=None)[source]
Standardize features by removing the mean and scaling to unit variance, calculated with differential privacy guarantees. Differential privacy is guaranteed on the learned scaler with respect to the training sample; the transformed output will certainly not satisfy differential privacy.
The standard score of a sample x is calculated as:
z = (x - u) / s
where u is the (differentially private) mean of the training samples or zero if with_mean=False, and s is the (differentially private) standard deviation of the training samples or one if with_std=False.
Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using the transform method.
For further information, users are referred to
sklearn.preprocessing.StandardScaler
.- Parameters:
epsilon (float, default: 1.0) – The privacy budget to be allocated to learning the mean and variance of the training sample. If with_std=True, the privacy budget is split evenly between mean and variance (the mean must be calculated even when with_mean=False, as it is used in the calculation of the variance.
bounds (tuple, optional) – Bounds of the data, provided as a tuple of the form (min, max). min and max can either be scalars, covering the min/max of the entire data, or vectors with one entry per feature. If not provided, the bounds are computed on the data when
.fit()
is first called, resulting in aPrivacyLeakWarning
.copy (boolean, default: True) – If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array, a copy may still be returned.
with_mean (boolean, True by default) – If True, center the data before scaling.
with_std (boolean, True by default) – If True, scale the data to unit variance (or equivalently, unit standard deviation).
random_state (int or RandomState, optional) – Controls the randomness of the model. To obtain a deterministic behaviour during randomisation,
random_state
has to be fixed to an integer.accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.
- scale_
Per feature relative scaling of the data. This is calculated using np.sqrt(var_). Equal to
None
whenwith_std=False
.- Type:
ndarray or None, shape (n_features,)
- mean_
The mean value for each feature in the training set. Equal to
None
whenwith_mean=False
.- Type:
ndarray or None, shape (n_features,)
- var_
The variance for each feature in the training set. Used to compute scale_. Equal to
None
whenwith_std=False
.- Type:
ndarray or None, shape (n_features,)
- n_samples_seen_
The number of samples processed by the estimator for each feature. If there are not missing samples, the
n_samples_seen
will be an integer, otherwise it will be an array. Will be reset on new calls to fit, but increments acrosspartial_fit
calls.- Type:
int or array, shape (n_features,)
See also
sklearn.preprocessing.StandardScaler
Vanilla scikit-learn version, without differential privacy.
PCA
Further removes the linear correlation across features with ‘whiten=True’.
Notes
NaNs are treated as missing values: disregarded in fit, and maintained in transform.
- fit(X, y=None, sample_weight=None)[source]
Compute the mean and std to be used for later scaling.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y (None) – Ignored.
sample_weight (array-like of shape (n_samples,), default=None) –
Individual weights for each sample.
New in version 0.24: parameter sample_weight support to StandardScaler.
- Returns:
self – Fitted scaler.
- Return type:
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- get_feature_names_out(input_features=None)
Get output feature names for transformation.
- Parameters:
input_features (array-like of str or None, default=None) –
Input features.
If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.
- Returns:
feature_names_out – Same as input features.
- Return type:
ndarray of str objects
- get_metadata_routing()
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequest
encapsulating routing information.- Return type:
MetadataRequest
- get_params(deep=True)
Get parameters for this estimator.
- inverse_transform(X, copy=None)[source]
Scale back the data to the original representation.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the features axis.
copy (bool, default=None) – Copy the input X or not.
- Returns:
X_tr – Transformed array.
- Return type:
{ndarray, sparse matrix} of shape (n_samples, n_features)
- partial_fit(X, y=None, sample_weight=None)[source]
Online computation of mean and std with differential privacy on X for later scaling. All of X is processed as a single batch. This is intended for cases when fit is not feasible due to very large number of n_samples or because X is read from a continuous stream.
The algorithm for incremental mean and std is given in Equation 1.5a,b in Chan, Tony F., Gene H. Golub, and Randall J. LeVeque. “Algorithms for computing the sample variance: Analysis and recommendations.” The American Statistician 37.3 (1983): 242-247:
- Parameters:
X ({array-like}, shape [n_samples, n_features]) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y – Ignored
sample_weight – Ignored by diffprivlib. Present for consistency with sklearn API.
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') StandardScaler
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- set_inverse_transform_request(*, copy: bool | None | str = '$UNCHANGED$') StandardScaler
Request metadata passed to the
inverse_transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toinverse_transform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toinverse_transform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- set_output(*, transform=None)
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
transform ({"default", "pandas"}, default=None) –
Configure output of transform and fit_transform.
”default”: Default output format of a transformer
”pandas”: DataFrame output
”polars”: Polars output
None: Transform configuration is unchanged
New in version 1.4: “polars” option was added.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_partial_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') StandardScaler
Request metadata passed to the
partial_fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topartial_fit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topartial_fit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- set_transform_request(*, copy: bool | None | str = '$UNCHANGED$') StandardScaler
Request metadata passed to the
transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed totransform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it totransform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- transform(X, copy=None)[source]
Perform standardization by centering and scaling.
- Parameters:
X ({array-like, sparse matrix of shape (n_samples, n_features)) – The data used to scale along the features axis.
copy (bool, default=None) – Copy the input X or not.
- Returns:
X_tr – Transformed array.
- Return type:
{ndarray, sparse matrix} of shape (n_samples, n_features)