diffprivlib.tools

Tools for data analysis with differential privacy.

Histogram functions

diffprivlib.tools.histogram(sample, epsilon=1.0, bins=10, range=None, weights=None, density=None, accountant=None, **unused_args)[source]

Compute the differentially private histogram of a set of data.

The histogram is computed using numpy.histogram, and noise added using GeometricTruncated to satisfy differential privacy. If the range parameter is not specified correctly, a PrivacyLeakWarning is thrown. Users are referred to numpy.histogram for more usage notes.

Parameters
  • sample (array_like) – Input data. The histogram is computed over the flattened array.

  • epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\) to be applied.

  • bins (int or sequence of scalars or str, default: 10) –

    If bins is an int, it defines the number of equal-width bins in the given range (10, by default). If bins is a sequence, it defines a monotonically increasing array of bin edges, including the rightmost edge, allowing for non-uniform bin widths.

    If bins is a string, it defines the method used to calculate the optimal bin width, as defined by histogram_bin_edges.

  • range ((float, float), optional) – The lower and upper range of the bins. If not provided, range is simply (a.min(), a.max()). Values outside the range are ignored. The first element of the range must be less than or equal to the second. range affects the automatic bin computation as well. While bin width is computed to be optimal based on the actual data within range, the bin count will fill the entire range including portions containing no data.

  • weights (array_like, optional) – An array of weights, of the same shape as a. Each value in a only contributes its associated weight towards the bin count (instead of 1). If density is True, the weights are normalized, so that the integral of the density over the range remains 1.

  • density (bool, optional) – If False, the result will contain the number of samples in each bin. If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function.

  • accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

Returns

  • hist (array) – The values of the histogram. See density and weights for a description of the possible semantics.

  • bin_edges (array of dtype float) – Return the bin edges (length(hist)+1).

Notes

All but the last (righthand-most) bin is half-open. In other words, if bins is:

[1, 2, 3, 4]

then the first bin is [1, 2) (including 1, but excluding 2) and the second [2, 3). The last bin, however, is [3, 4], which includes 4.

diffprivlib.tools.histogramdd(sample, epsilon=1.0, bins=10, range=None, weights=None, density=None, accountant=None, **unused_args)[source]

Compute the differentially private multidimensional histogram of some data.

The histogram is computed using numpy.histogramdd, and noise added using GeometricTruncated to satisfy differential privacy. If the range parameter is not specified correctly, a PrivacyLeakWarning is thrown. Users are referred to numpy.histogramdd for more usage notes.

Parameters
  • sample ((N, D) array, or (D, N) array_like) –

    The data to be histogrammed.

    Note the unusual interpretation of sample when an array_like:

    • When an array, each row is a coordinate in a D-dimensional space - such as histogramgramdd(np.array([p1, p2, p3])).

    • When an array_like, each element is the list of values for single coordinate - such as histogramgramdd((X, Y, Z)).

    The first form should be preferred.

  • epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\) to be applied.

  • bins (sequence or int, default: 10) –

    The bin specification:

    • A sequence of arrays describing the monotonically increasing bin edges along each dimension.

    • The number of bins for each dimension (nx, ny, … =bins)

    • The number of bins for all dimensions (nx=ny=…=bins).

  • range (sequence, optional) – A sequence of length D, each an optional (lower, upper) tuple giving the outer bin edges to be used if the edges are not given explicitly in bins. An entry of None in the sequence results in the minimum and maximum values being used for the corresponding dimension. The default, None, is equivalent to passing a tuple of D None values.

  • density (bool, optional) – If False, the default, returns the number of samples in each bin. If True, returns the probability density function at the bin, bin_count / sample_count / bin_volume.

  • weights ((N,) array_like, optional) – An array of values w_i weighing each sample (x_i, y_i, z_i, …). Weights are normalized to 1 if normed is True. If normed is False, the values of the returned histogram are equal to the sum of the weights belonging to the samples falling into each bin.

  • accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

Returns

  • H (ndarray) – The multidimensional histogram of sample x. See normed and weights for the different possible semantics.

  • edges (list) – A list of D arrays describing the bin edges for each dimension.

See also

histogram()

1-D differentially private histogram

histogram2d()

2-D differentially private histogram

diffprivlib.tools.histogram2d(array_x, array_y, epsilon=1.0, bins=10, range=None, weights=None, density=None, accountant=None, **unused_args)[source]

Compute the differentially private bi-dimensional histogram of two data samples.

Parameters
  • array_x (array_like, shape (N,)) – An array containing the x coordinates of the points to be histogrammed.

  • array_y (array_like, shape (N,)) – An array containing the y coordinates of the points to be histogrammed.

  • epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\) to be applied.

  • bins (int or array_like or [int, int] or [array, array], default: 10) –

    The bin specification:

    • If int, the number of bins for the two dimensions (nx=ny=bins).

    • If array_like, the bin edges for the two dimensions (x_edges=y_edges=bins).

    • If [int, int], the number of bins in each dimension (nx, ny = bins).

    • If [array, array], the bin edges in each dimension (x_edges, y_edges = bins).

    • A combination [int, array] or [array, int], where int is the number of bins and array is the bin edges.

  • range (array_like, shape(2,2), optional) – The leftmost and rightmost edges of the bins along each dimension (if not specified explicitly in the bins parameters): [[xmin, xmax], [ymin, ymax]]. All values outside of this range will be considered outliers and not tallied in the histogram.

  • density (bool, optional) – If False, the default, returns the number of samples in each bin. If True, returns the probability density function at the bin, bin_count / sample_count / bin_area.

  • weights (array_like, shape(N,), optional) – An array of values w_i weighing each sample (x_i, y_i). Weights are normalized to 1 if normed is True. If normed is False, the values of the returned histogram are equal to the sum of the weights belonging to the samples falling into each bin.

  • accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

Returns

  • H (ndarray, shape(nx, ny)) – The bi-dimensional histogram of samples x and y. Values in x are histogrammed along the first dimension and values in y are histogrammed along the second dimension.

  • xedges (ndarray, shape(nx+1,)) – The bin edges along the first dimension.

  • yedges (ndarray, shape(ny+1,)) – The bin edges along the second dimension.

See also

histogram()

1D differentially private histogram

histogramdd()

Differentially private Multidimensional histogram

Notes

When normed is True, then the returned histogram is the sample density, defined such that the sum over bins of the product bin_value * bin_area is 1.

Please note that the histogram does not follow the Cartesian convention where x values are on the abscissa and y values on the ordinate axis. Rather, x is histogrammed along the first dimension of the array (vertical), and y along the second dimension of the array (horizontal). This ensures compatibility with histogramdd.

General Utilities

diffprivlib.tools.count_nonzero(array, epsilon=1.0, accountant=None, axis=None, keepdims=False)[source]

Counts the number of non-zero values in the array array with differential privacy.

The word “non-zero” is in reference to the Python 2.x built-in method __nonzero__() (renamed __bool__() in Python 3.x) of Python objects that tests an object’s “truthfulness”. For example, any number is considered truthful if it is nonzero, whereas any string is considered truthful if it is not the empty string. Thus, this function (recursively) counts how many elements in array (and in sub-arrays thereof) have their __nonzero__() or __bool__() method evaluated to True.

Parameters
  • array (array_like) – The array for which to count non-zeros.

  • epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).

  • accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

  • axis (int or tuple, optional) – Axis or tuple of axes along which to count non-zeros. Default is None, meaning that non-zeros will be counted along a flattened version of array.

  • keepdims (bool, optional) – If this is set to True, the axes that are counted are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.

Returns

count – Differentially private number of non-zero values in the array along a given axis. Otherwise, the total number of non-zero values in the array is returned.

Return type

int or array of int

diffprivlib.tools.mean(array, epsilon=1.0, bounds=None, axis=None, dtype=None, keepdims=<no value>, accountant=None, **unused_args)[source]

Compute the differentially private arithmetic mean along the specified axis.

Returns the average of the array elements with differential privacy. The average is taken over the flattened array by default, otherwise over the specified axis. Noise is added using Laplace to satisfy differential privacy, where sensitivity is calculated using bounds. Users are advised to consult the documentation of numpy.mean for further details, as the behaviour of mean closely follows its Numpy variant.

Parameters
  • array (array_like) – Array containing numbers whose mean is desired. If array is not an array, a conversion is attempted.

  • epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).

  • bounds (tuple, optional) – Bounds of the values of the array, of the form (min, max).

  • axis (int or tuple of ints, optional) –

    Axis or axes along which the means are computed. The default is to compute the mean of the flattened array.

    If this is a tuple of ints, a mean is performed over multiple axes, instead of a single axis or all the axes as before.

  • dtype (data-type, optional) – Type to use in computing the mean. For integer inputs, the default is float64; for floating point inputs, it is the same as the input dtype.

  • keepdims (bool, optional) –

    If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.

    If the default value is passed, then keepdims will not be passed through to the mean method of sub-classes of ndarray, however any non-default value will be. If the sub-class’ method does not implement keepdims any exceptions will be raised.

  • accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

Returns

m – Returns a new array containing the mean values.

Return type

ndarray, see dtype parameter above

See also

std(), var(), nanmean()

diffprivlib.tools.nanmean(array, epsilon=1.0, bounds=None, axis=None, dtype=None, keepdims=<no value>, accountant=None, **unused_args)[source]

Compute the differentially private arithmetic mean along the specified axis, ignoring NaNs.

Returns the average of the array elements with differential privacy. The average is taken over the flattened array by default, otherwise over the specified axis. Noise is added using Laplace to satisfy differential privacy, where sensitivity is calculated using bounds. Users are advised to consult the documentation of numpy.mean for further details, as the behaviour of mean closely follows its Numpy variant.

For all-NaN slices, NaN is returned and a RuntimeWarning is raised.

Parameters
  • array (array_like) – Array containing numbers whose mean is desired. If array is not an array, a conversion is attempted.

  • epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).

  • bounds (tuple, optional) – Bounds of the values of the array, of the form (min, max).

  • axis (int or tuple of ints, optional) –

    Axis or axes along which the means are computed. The default is to compute the mean of the flattened array.

    If this is a tuple of ints, a mean is performed over multiple axes, instead of a single axis or all the axes as before.

  • dtype (data-type, optional) – Type to use in computing the mean. For integer inputs, the default is float64; for floating point inputs, it is the same as the input dtype.

  • keepdims (bool, optional) –

    If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.

    If the default value is passed, then keepdims will not be passed through to the mean method of sub-classes of ndarray, however any non-default value will be. If the sub-class’ method does not implement keepdims any exceptions will be raised.

  • accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

Returns

m – Returns a new array containing the mean values.

Return type

ndarray, see dtype parameter above

See also

std(), var(), mean()

diffprivlib.tools.std(array, epsilon=1.0, bounds=None, axis=None, dtype=None, keepdims=<no value>, accountant=None, **unused_args)[source]

Compute the standard deviation along the specified axis.

Returns the standard deviation of the array elements, a measure of the spread of a distribution, with differential privacy. The standard deviation is computed for the flattened array by default, otherwise over the specified axis. Noise is added using LaplaceBoundedDomain to satisfy differential privacy, where sensitivity is calculated using bounds. Users are advised to consult the documentation of numpy.std for further details, as the behaviour of std closely follows its Numpy variant.

Parameters
  • array (array_like) – Calculate the standard deviation of these values.

  • epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).

  • bounds (tuple, optional) – Bounds of the values of the array, of the form (min, max).

  • axis (int or tuple of ints, optional) –

    Axis or axes along which the standard deviation is computed. The default is to compute the standard deviation of the flattened array.

    If this is a tuple of ints, a standard deviation is performed over multiple axes, instead of a single axis or all the axes as before.

  • dtype (dtype, optional) – Type to use in computing the standard deviation. For arrays of integer type the default is float64, for arrays of float types it is the same as the array type.

  • keepdims (bool, optional) –

    If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.

    If the default value is passed, then keepdims will not be passed through to the std method of sub-classes of ndarray, however any non-default value will be. If the sub-class’ method does not implement keepdims any exceptions will be raised.

  • accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

Returns

standard_deviation – Return a new array containing the standard deviation.

Return type

ndarray, see dtype parameter above.

See also

var(), mean(), nanstd()

diffprivlib.tools.nanstd(array, epsilon=1.0, bounds=None, axis=None, dtype=None, keepdims=<no value>, accountant=None, **unused_args)[source]

Compute the standard deviation along the specified axis, ignoring NaNs.

Returns the standard deviation of the array elements, a measure of the spread of a distribution, with differential privacy. The standard deviation is computed for the flattened array by default, otherwise over the specified axis. Noise is added using LaplaceBoundedDomain to satisfy differential privacy, where sensitivity is calculated using bounds. Users are advised to consult the documentation of numpy.std for further details, as the behaviour of std closely follows its Numpy variant.

For all-NaN slices, NaN is returned and a RuntimeWarning is raised.

Parameters
  • array (array_like) – Calculate the standard deviation of these values.

  • epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).

  • bounds (tuple, optional) – Bounds of the values of the array, of the form (min, max).

  • axis (int or tuple of ints, optional) –

    Axis or axes along which the standard deviation is computed. The default is to compute the standard deviation of the flattened array.

    If this is a tuple of ints, a standard deviation is performed over multiple axes, instead of a single axis or all the axes as before.

  • dtype (dtype, optional) – Type to use in computing the standard deviation. For arrays of integer type the default is float64, for arrays of float types it is the same as the array type.

  • keepdims (bool, optional) –

    If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.

    If the default value is passed, then keepdims will not be passed through to the std method of sub-classes of ndarray, however any non-default value will be. If the sub-class’ method does not implement keepdims any exceptions will be raised.

  • accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

Returns

standard_deviation – Return a new array containing the standard deviation.

Return type

ndarray, see dtype parameter above.

See also

var(), mean(), std()

diffprivlib.tools.sum(array, epsilon=1.0, bounds=None, accountant=None, axis=None, dtype=None, keepdims=<no value>, **unused_args)[source]

Sum of array elements over a given axis with differential privacy.

Parameters
  • array (array_like) – Elements to sum.

  • epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).

  • bounds (tuple, optional) – Bounds of the values of the array, of the form (min, max).

  • accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

  • axis (None or int or tuple of ints, optional) –

    Axis or axes along which a sum is performed. The default, axis=None, will sum all of the elements of the input array. If axis is negative it counts from the last to the first axis.

    If axis is a tuple of ints, a sum is performed on all of the axes specified in the tuple instead of a single axis or all the axes as before.

  • dtype (dtype, optional) – The type of the returned array and of the accumulator in which the elements are summed. The dtype of array is used by default unless array has an integer dtype of less precision than the default platform integer. In that case, if array is signed then the platform integer is used while if array is unsigned then an unsigned integer of the same precision as the platform integer is used.

  • keepdims (bool, optional) –

    If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.

    If the default value is passed, then keepdims will not be passed through to the sum method of sub-classes of ndarray, however any non-default value will be. If the sub-class’ method does not implement keepdims any exceptions will be raised.

Returns

sum_along_axis – An array with the same shape as array, with the specified axis removed. If array is a 0-d array, or if axis is None, a scalar is returned.

Return type

ndarray

See also

ndarray.sum()

Equivalent non-private method.

mean(), nansum()

diffprivlib.tools.nansum(array, epsilon=1.0, bounds=None, accountant=None, axis=None, dtype=None, keepdims=<no value>, **unused_args)[source]

Sum of array elements over a given axis with differential privacy, ignoring NaNs.

Parameters
  • array (array_like) – Elements to sum.

  • epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).

  • bounds (tuple, optional) – Bounds of the values of the array, of the form (min, max).

  • accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

  • axis (None or int or tuple of ints, optional) –

    Axis or axes along which a sum is performed. The default, axis=None, will sum all of the elements of the input array. If axis is negative it counts from the last to the first axis.

    If axis is a tuple of ints, a sum is performed on all of the axes specified in the tuple instead of a single axis or all the axes as before.

  • dtype (dtype, optional) – The type of the returned array and of the accumulator in which the elements are summed. The dtype of array is used by default unless array has an integer dtype of less precision than the default platform integer. In that case, if array is signed then the platform integer is used while if array is unsigned then an unsigned integer of the same precision as the platform integer is used.

  • keepdims (bool, optional) –

    If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.

    If the default value is passed, then keepdims will not be passed through to the sum method of sub-classes of ndarray, however any non-default value will be. If the sub-class’ method does not implement keepdims any exceptions will be raised.

Returns

sum_along_axis – An array with the same shape as array, with the specified axis removed. If array is a 0-d array, or if axis is None, a scalar is returned. If an output array is specified, a reference to out is returned.

Return type

ndarray

See also

ndarray.sum()

Equivalent non-private method.

mean(), sum()

diffprivlib.tools.var(array, epsilon=1.0, bounds=None, axis=None, dtype=None, keepdims=<no value>, accountant=None, **unused_args)[source]

Compute the differentially private variance along the specified axis.

Returns the variance of the array elements, a measure of the spread of a distribution, with differential privacy. The variance is computer for the flattened array by default, otherwise over the specified axis. Noise is added using LaplaceBoundedDomain to satisfy differential privacy, where sensitivity is calculated using bounds. Users are advised to consult the documentation of numpy.var for further details, as the behaviour of var closely follows its Numpy variant.

Parameters
  • array (array_like) – Array containing numbers whose variance is desired. If array is not an array, a conversion is attempted.

  • epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).

  • bounds (tuple, optional) – Bounds of the values of the array, of the form (min, max).

  • axis (int or tuple of ints, optional) –

    Axis or axes along which the variance is computed. The default is to compute the variance of the flattened array.

    If this is a tuple of ints, a variance is performed over multiple axes, instead of a single axis or all the axes as before.

  • dtype (data-type, optional) – Type to use in computing the variance. For arrays of integer type the default is float32; for arrays of float types it is the same as the array type.

  • keepdims (bool, optional) –

    If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.

    If the default value is passed, then keepdims will not be passed through to the var method of sub-classes of ndarray, however any non-default value will be. If the sub-class’ method does not implement keepdims any exceptions will be raised.

  • accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

Returns

variance – Returns a new array containing the variance.

Return type

ndarray, see dtype parameter above

See also

std(), mean(), nanvar()

diffprivlib.tools.nanvar(array, epsilon=1.0, bounds=None, axis=None, dtype=None, keepdims=<no value>, accountant=None, **unused_args)[source]

Compute the differentially private variance along the specified axis, ignoring NaNs.

Returns the variance of the array elements, a measure of the spread of a distribution, with differential privacy. The variance is computer for the flattened array by default, otherwise over the specified axis. Noise is added using LaplaceBoundedDomain to satisfy differential privacy, where sensitivity is calculated using bounds. Users are advised to consult the documentation of numpy.var for further details, as the behaviour of var closely follows its Numpy variant.

For all-NaN slices, NaN is returned and a RuntimeWarning is raised.

Parameters
  • array (array_like) – Array containing numbers whose variance is desired. If array is not an array, a conversion is attempted.

  • epsilon (float, default: 1.0) – Privacy parameter \(\epsilon\).

  • bounds (tuple, optional) – Bounds of the values of the array, of the form (min, max).

  • axis (int or tuple of ints, optional) –

    Axis or axes along which the variance is computed. The default is to compute the variance of the flattened array.

    If this is a tuple of ints, a variance is performed over multiple axes, instead of a single axis or all the axes as before.

  • dtype (data-type, optional) – Type to use in computing the variance. For arrays of integer type the default is float32; for arrays of float types it is the same as the array type.

  • keepdims (bool, optional) –

    If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.

    If the default value is passed, then keepdims will not be passed through to the var method of sub-classes of ndarray, however any non-default value will be. If the sub-class’ method does not implement keepdims any exceptions will be raised.

  • accountant (BudgetAccountant, optional) – Accountant to keep track of privacy budget.

Returns

variance – If out=None, returns a new array containing the variance; otherwise, a reference to the output array is returned.

Return type

ndarray, see dtype parameter above

See also

std(), mean(), var()