diffprivlib.tools
¶
Tools for data analysis with differential privacy.
Histogram functions¶

diffprivlib.tools.
histogram
(sample, epsilon=1.0, bins=10, range=None, weights=None, density=None, accountant=None, **unused_args)[source]¶ Compute the differentially private histogram of a set of data.
The histogram is computed using
numpy.histogram
, and noise added usingGeometricTruncated
to satisfy differential privacy. If the range parameter is not specified correctly, aPrivacyLeakWarning
is thrown. Users are referred tonumpy.histogram
for more usage notes. Parameters
sample (array_like) – Input data. The histogram is computed over the flattened array.
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\) to be applied.
bins (int or sequence of scalars or str, default: 10) –
If bins is an int, it defines the number of equalwidth bins in the given range (10, by default). If bins is a sequence, it defines a monotonically increasing array of bin edges, including the rightmost edge, allowing for nonuniform bin widths.
If bins is a string, it defines the method used to calculate the optimal bin width, as defined by histogram_bin_edges.
range ((float, float), optional) – The lower and upper range of the bins. If not provided, range is simply
(a.min(), a.max())
. Values outside the range are ignored. The first element of the range must be less than or equal to the second. range affects the automatic bin computation as well. While bin width is computed to be optimal based on the actual data within range, the bin count will fill the entire range including portions containing no data.weights (array_like, optional) – An array of weights, of the same shape as a. Each value in a only contributes its associated weight towards the bin count (instead of 1). If density is True, the weights are normalized, so that the integral of the density over the range remains 1.
density (bool, optional) – If
False
, the result will contain the number of samples in each bin. IfTrue
, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function.accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.
 Returns
hist (array) – The values of the histogram. See density and weights for a description of the possible semantics.
bin_edges (array of dtype float) – Return the bin edges
(length(hist)+1)
.
See also
Notes
All but the last (righthandmost) bin is halfopen. In other words, if bins is:
[1, 2, 3, 4]
then the first bin is
[1, 2)
(including 1, but excluding 2) and the second[2, 3)
. The last bin, however, is[3, 4]
, which includes 4.

diffprivlib.tools.
histogramdd
(sample, epsilon=1.0, bins=10, range=None, weights=None, density=None, accountant=None, **unused_args)[source]¶ Compute the differentially private multidimensional histogram of some data.
The histogram is computed using
numpy.histogramdd
, and noise added usingGeometricTruncated
to satisfy differential privacy. If the range parameter is not specified correctly, aPrivacyLeakWarning
is thrown. Users are referred tonumpy.histogramdd
for more usage notes. Parameters
sample ((N, D) array, or (D, N) array_like) –
The data to be histogrammed.
Note the unusual interpretation of sample when an array_like:
When an array, each row is a coordinate in a Ddimensional space  such as
histogramgramdd(np.array([p1, p2, p3]))
.When an array_like, each element is the list of values for single coordinate  such as
histogramgramdd((X, Y, Z))
.
The first form should be preferred.
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\) to be applied.
bins (sequence or int, default: 10) –
The bin specification:
A sequence of arrays describing the monotonically increasing bin edges along each dimension.
The number of bins for each dimension (nx, ny, … =bins)
The number of bins for all dimensions (nx=ny=…=bins).
range (sequence, optional) – A sequence of length D, each an optional (lower, upper) tuple giving the outer bin edges to be used if the edges are not given explicitly in bins. An entry of None in the sequence results in the minimum and maximum values being used for the corresponding dimension. The default, None, is equivalent to passing a tuple of D None values.
density (bool, optional) – If False, the default, returns the number of samples in each bin. If True, returns the probability density function at the bin,
bin_count / sample_count / bin_volume
.weights ((N,) array_like, optional) – An array of values w_i weighing each sample (x_i, y_i, z_i, …). Weights are normalized to 1 if normed is True. If normed is False, the values of the returned histogram are equal to the sum of the weights belonging to the samples falling into each bin.
accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.
 Returns
H (ndarray) – The multidimensional histogram of sample x. See normed and weights for the different possible semantics.
edges (list) – A list of D arrays describing the bin edges for each dimension.
See also
histogram()
1D differentially private histogram
histogram2d()
2D differentially private histogram

diffprivlib.tools.
histogram2d
(array_x, array_y, epsilon=1.0, bins=10, range=None, weights=None, density=None, accountant=None, **unused_args)[source]¶ Compute the differentially private bidimensional histogram of two data samples.
 Parameters
array_x (array_like, shape (N,)) – An array containing the x coordinates of the points to be histogrammed.
array_y (array_like, shape (N,)) – An array containing the y coordinates of the points to be histogrammed.
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\) to be applied.
bins (int or array_like or [int, int] or [array, array], default: 10) –
The bin specification:
If int, the number of bins for the two dimensions (nx=ny=bins).
If array_like, the bin edges for the two dimensions (x_edges=y_edges=bins).
If [int, int], the number of bins in each dimension (nx, ny = bins).
If [array, array], the bin edges in each dimension (x_edges, y_edges = bins).
A combination [int, array] or [array, int], where int is the number of bins and array is the bin edges.
range (array_like, shape(2,2), optional) – The leftmost and rightmost edges of the bins along each dimension (if not specified explicitly in the bins parameters):
[[xmin, xmax], [ymin, ymax]]
. All values outside of this range will be considered outliers and not tallied in the histogram.density (bool, optional) – If False, the default, returns the number of samples in each bin. If True, returns the probability density function at the bin,
bin_count / sample_count / bin_area
.weights (array_like, shape(N,), optional) – An array of values
w_i
weighing each sample(x_i, y_i)
. Weights are normalized to 1 if normed is True. If normed is False, the values of the returned histogram are equal to the sum of the weights belonging to the samples falling into each bin.accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.
 Returns
H (ndarray, shape(nx, ny)) – The bidimensional histogram of samples x and y. Values in x are histogrammed along the first dimension and values in y are histogrammed along the second dimension.
xedges (ndarray, shape(nx+1,)) – The bin edges along the first dimension.
yedges (ndarray, shape(ny+1,)) – The bin edges along the second dimension.
See also
histogram()
1D differentially private histogram
histogramdd()
Differentially private Multidimensional histogram
Notes
When normed is True, then the returned histogram is the sample density, defined such that the sum over bins of the product
bin_value * bin_area
is 1.Please note that the histogram does not follow the Cartesian convention where x values are on the abscissa and y values on the ordinate axis. Rather, x is histogrammed along the first dimension of the array (vertical), and y along the second dimension of the array (horizontal). This ensures compatibility with histogramdd.
General Utilities¶

diffprivlib.tools.
count_nonzero
(array, epsilon=1.0, accountant=None, axis=None, keepdims=False)[source]¶ Counts the number of nonzero values in the array
array
with differential privacy.The word “nonzero” is in reference to the Python 2.x builtin method
__nonzero__()
(renamed__bool__()
in Python 3.x) of Python objects that tests an object’s “truthfulness”. For example, any number is considered truthful if it is nonzero, whereas any string is considered truthful if it is not the empty string. Thus, this function (recursively) counts how many elements inarray
(and in subarrays thereof) have their__nonzero__()
or__bool__()
method evaluated toTrue
. Parameters
array (array_like) – The array for which to count nonzeros.
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).
accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.
axis (int or tuple, optional) – Axis or tuple of axes along which to count nonzeros. Default is None, meaning that nonzeros will be counted along a flattened version of
array
.keepdims (bool, optional) – If this is set to True, the axes that are counted are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.
 Returns
count – Differentially private number of nonzero values in the array along a given axis. Otherwise, the total number of nonzero values in the array is returned.
 Return type
int or array of int

diffprivlib.tools.
mean
(array, epsilon=1.0, bounds=None, axis=None, dtype=None, keepdims=<no value>, accountant=None, **unused_args)[source]¶ Compute the differentially private arithmetic mean along the specified axis.
Returns the average of the array elements with differential privacy. The average is taken over the flattened array by default, otherwise over the specified axis. Noise is added using
Laplace
to satisfy differential privacy, where sensitivity is calculated using bounds. Users are advised to consult the documentation ofnumpy.mean
for further details, as the behaviour of mean closely follows its Numpy variant. Parameters
array (array_like) – Array containing numbers whose mean is desired. If array is not an array, a conversion is attempted.
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).
bounds (tuple, optional) – Bounds of the values of the array, of the form (min, max).
axis (int or tuple of ints, optional) –
Axis or axes along which the means are computed. The default is to compute the mean of the flattened array.
If this is a tuple of ints, a mean is performed over multiple axes, instead of a single axis or all the axes as before.
dtype (datatype, optional) – Type to use in computing the mean. For integer inputs, the default is float64; for floating point inputs, it is the same as the input dtype.
keepdims (bool, optional) –
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.
If the default value is passed, then keepdims will not be passed through to the mean method of subclasses of ndarray, however any nondefault value will be. If the subclass’ method does not implement keepdims any exceptions will be raised.
accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.
 Returns
m – Returns a new array containing the mean values.
 Return type
ndarray, see dtype parameter above

diffprivlib.tools.
nanmean
(array, epsilon=1.0, bounds=None, axis=None, dtype=None, keepdims=<no value>, accountant=None, **unused_args)[source]¶ Compute the differentially private arithmetic mean along the specified axis, ignoring NaNs.
Returns the average of the array elements with differential privacy. The average is taken over the flattened array by default, otherwise over the specified axis. Noise is added using
Laplace
to satisfy differential privacy, where sensitivity is calculated using bounds. Users are advised to consult the documentation ofnumpy.mean
for further details, as the behaviour of mean closely follows its Numpy variant.For allNaN slices, NaN is returned and a RuntimeWarning is raised.
 Parameters
array (array_like) – Array containing numbers whose mean is desired. If array is not an array, a conversion is attempted.
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).
bounds (tuple, optional) – Bounds of the values of the array, of the form (min, max).
axis (int or tuple of ints, optional) –
Axis or axes along which the means are computed. The default is to compute the mean of the flattened array.
If this is a tuple of ints, a mean is performed over multiple axes, instead of a single axis or all the axes as before.
dtype (datatype, optional) – Type to use in computing the mean. For integer inputs, the default is float64; for floating point inputs, it is the same as the input dtype.
keepdims (bool, optional) –
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.
If the default value is passed, then keepdims will not be passed through to the mean method of subclasses of ndarray, however any nondefault value will be. If the subclass’ method does not implement keepdims any exceptions will be raised.
accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.
 Returns
m – Returns a new array containing the mean values.
 Return type
ndarray, see dtype parameter above

diffprivlib.tools.
std
(array, epsilon=1.0, bounds=None, axis=None, dtype=None, keepdims=<no value>, accountant=None, **unused_args)[source]¶ Compute the standard deviation along the specified axis.
Returns the standard deviation of the array elements, a measure of the spread of a distribution, with differential privacy. The standard deviation is computed for the flattened array by default, otherwise over the specified axis. Noise is added using
LaplaceBoundedDomain
to satisfy differential privacy, where sensitivity is calculated using bounds. Users are advised to consult the documentation ofnumpy.std
for further details, as the behaviour of std closely follows its Numpy variant. Parameters
array (array_like) – Calculate the standard deviation of these values.
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).
bounds (tuple, optional) – Bounds of the values of the array, of the form (min, max).
axis (int or tuple of ints, optional) –
Axis or axes along which the standard deviation is computed. The default is to compute the standard deviation of the flattened array.
If this is a tuple of ints, a standard deviation is performed over multiple axes, instead of a single axis or all the axes as before.
dtype (dtype, optional) – Type to use in computing the standard deviation. For arrays of integer type the default is float64, for arrays of float types it is the same as the array type.
keepdims (bool, optional) –
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.
If the default value is passed, then keepdims will not be passed through to the std method of subclasses of ndarray, however any nondefault value will be. If the subclass’ method does not implement keepdims any exceptions will be raised.
accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.
 Returns
standard_deviation – Return a new array containing the standard deviation.
 Return type
ndarray, see dtype parameter above.

diffprivlib.tools.
nanstd
(array, epsilon=1.0, bounds=None, axis=None, dtype=None, keepdims=<no value>, accountant=None, **unused_args)[source]¶ Compute the standard deviation along the specified axis, ignoring NaNs.
Returns the standard deviation of the array elements, a measure of the spread of a distribution, with differential privacy. The standard deviation is computed for the flattened array by default, otherwise over the specified axis. Noise is added using
LaplaceBoundedDomain
to satisfy differential privacy, where sensitivity is calculated using bounds. Users are advised to consult the documentation ofnumpy.std
for further details, as the behaviour of std closely follows its Numpy variant.For allNaN slices, NaN is returned and a RuntimeWarning is raised.
 Parameters
array (array_like) – Calculate the standard deviation of these values.
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).
bounds (tuple, optional) – Bounds of the values of the array, of the form (min, max).
axis (int or tuple of ints, optional) –
Axis or axes along which the standard deviation is computed. The default is to compute the standard deviation of the flattened array.
If this is a tuple of ints, a standard deviation is performed over multiple axes, instead of a single axis or all the axes as before.
dtype (dtype, optional) – Type to use in computing the standard deviation. For arrays of integer type the default is float64, for arrays of float types it is the same as the array type.
keepdims (bool, optional) –
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.
If the default value is passed, then keepdims will not be passed through to the std method of subclasses of ndarray, however any nondefault value will be. If the subclass’ method does not implement keepdims any exceptions will be raised.
accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.
 Returns
standard_deviation – Return a new array containing the standard deviation.
 Return type
ndarray, see dtype parameter above.

diffprivlib.tools.
sum
(array, epsilon=1.0, bounds=None, accountant=None, axis=None, dtype=None, keepdims=<no value>, **unused_args)[source]¶ Sum of array elements over a given axis with differential privacy.
 Parameters
array (array_like) – Elements to sum.
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).
bounds (tuple, optional) – Bounds of the values of the array, of the form (min, max).
accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.
axis (None or int or tuple of ints, optional) –
Axis or axes along which a sum is performed. The default, axis=None, will sum all of the elements of the input array. If axis is negative it counts from the last to the first axis.
If axis is a tuple of ints, a sum is performed on all of the axes specified in the tuple instead of a single axis or all the axes as before.
dtype (dtype, optional) – The type of the returned array and of the accumulator in which the elements are summed. The dtype of array is used by default unless array has an integer dtype of less precision than the default platform integer. In that case, if array is signed then the platform integer is used while if array is unsigned then an unsigned integer of the same precision as the platform integer is used.
keepdims (bool, optional) –
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.
If the default value is passed, then keepdims will not be passed through to the sum method of subclasses of ndarray, however any nondefault value will be. If the subclass’ method does not implement keepdims any exceptions will be raised.
 Returns
sum_along_axis – An array with the same shape as array, with the specified axis removed. If array is a 0d array, or if axis is None, a scalar is returned.
 Return type
ndarray

diffprivlib.tools.
nansum
(array, epsilon=1.0, bounds=None, accountant=None, axis=None, dtype=None, keepdims=<no value>, **unused_args)[source]¶ Sum of array elements over a given axis with differential privacy, ignoring NaNs.
 Parameters
array (array_like) – Elements to sum.
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).
bounds (tuple, optional) – Bounds of the values of the array, of the form (min, max).
accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.
axis (None or int or tuple of ints, optional) –
Axis or axes along which a sum is performed. The default, axis=None, will sum all of the elements of the input array. If axis is negative it counts from the last to the first axis.
If axis is a tuple of ints, a sum is performed on all of the axes specified in the tuple instead of a single axis or all the axes as before.
dtype (dtype, optional) – The type of the returned array and of the accumulator in which the elements are summed. The dtype of array is used by default unless array has an integer dtype of less precision than the default platform integer. In that case, if array is signed then the platform integer is used while if array is unsigned then an unsigned integer of the same precision as the platform integer is used.
keepdims (bool, optional) –
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.
If the default value is passed, then keepdims will not be passed through to the sum method of subclasses of ndarray, however any nondefault value will be. If the subclass’ method does not implement keepdims any exceptions will be raised.
 Returns
sum_along_axis – An array with the same shape as array, with the specified axis removed. If array is a 0d array, or if axis is None, a scalar is returned. If an output array is specified, a reference to out is returned.
 Return type
ndarray

diffprivlib.tools.
var
(array, epsilon=1.0, bounds=None, axis=None, dtype=None, keepdims=<no value>, accountant=None, **unused_args)[source]¶ Compute the differentially private variance along the specified axis.
Returns the variance of the array elements, a measure of the spread of a distribution, with differential privacy. The variance is computer for the flattened array by default, otherwise over the specified axis. Noise is added using
LaplaceBoundedDomain
to satisfy differential privacy, where sensitivity is calculated using bounds. Users are advised to consult the documentation ofnumpy.var
for further details, as the behaviour of var closely follows its Numpy variant. Parameters
array (array_like) – Array containing numbers whose variance is desired. If array is not an array, a conversion is attempted.
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).
bounds (tuple, optional) – Bounds of the values of the array, of the form (min, max).
axis (int or tuple of ints, optional) –
Axis or axes along which the variance is computed. The default is to compute the variance of the flattened array.
If this is a tuple of ints, a variance is performed over multiple axes, instead of a single axis or all the axes as before.
dtype (datatype, optional) – Type to use in computing the variance. For arrays of integer type the default is float32; for arrays of float types it is the same as the array type.
keepdims (bool, optional) –
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.
If the default value is passed, then keepdims will not be passed through to the var method of subclasses of ndarray, however any nondefault value will be. If the subclass’ method does not implement keepdims any exceptions will be raised.
accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.
 Returns
variance – Returns a new array containing the variance.
 Return type
ndarray, see dtype parameter above

diffprivlib.tools.
nanvar
(array, epsilon=1.0, bounds=None, axis=None, dtype=None, keepdims=<no value>, accountant=None, **unused_args)[source]¶ Compute the differentially private variance along the specified axis, ignoring NaNs.
Returns the variance of the array elements, a measure of the spread of a distribution, with differential privacy. The variance is computer for the flattened array by default, otherwise over the specified axis. Noise is added using
LaplaceBoundedDomain
to satisfy differential privacy, where sensitivity is calculated using bounds. Users are advised to consult the documentation ofnumpy.var
for further details, as the behaviour of var closely follows its Numpy variant.For allNaN slices, NaN is returned and a RuntimeWarning is raised.
 Parameters
array (array_like) – Array containing numbers whose variance is desired. If array is not an array, a conversion is attempted.
epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).
bounds (tuple, optional) – Bounds of the values of the array, of the form (min, max).
axis (int or tuple of ints, optional) –
Axis or axes along which the variance is computed. The default is to compute the variance of the flattened array.
If this is a tuple of ints, a variance is performed over multiple axes, instead of a single axis or all the axes as before.
dtype (datatype, optional) – Type to use in computing the variance. For arrays of integer type the default is float32; for arrays of float types it is the same as the array type.
keepdims (bool, optional) –
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.
If the default value is passed, then keepdims will not be passed through to the var method of subclasses of ndarray, however any nondefault value will be. If the subclass’ method does not implement keepdims any exceptions will be raised.
accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.
 Returns
variance – If
out=None
, returns a new array containing the variance; otherwise, a reference to the output array is returned. Return type
ndarray, see dtype parameter above