We present a differentially private mechanism to display statistics (e.g., the moving average) of a stream of real valued observations where the bound on each observation is either too conservative or unknown in advance. This is particularly relevant to scenarios of real-time data monitoring and reporting, e.g., energy data through smart meters. Our focus is on real-world data streams whose distribution is light-tailed, meaning that the tail approaches zero at least as fast as the exponential distribution. For such data streams, individual observations are expected to be concentrated below an unknown threshold. Estimating this threshold from the data can potentially violate privacy as it would reveal particular events tied to individuals [1]. On the other hand an overly conservative threshold may impact accuracy by adding more noise than necessary. We construct a utility optimizing differentially private mechanism to release this threshold based on the input stream. Our main advantage over the state-of-the-art algorithms is that the resulting noise added to each observation of the stream is scaled to the threshold instead of a possibly much larger bound; resulting in considerable gain in utility when the difference is significant. Using two real-world datasets, we demonstrate that our mechanism, on average, improves the utility by a factor of 3.5 on the first dataset, and 9 on the other. While our main focus is on continual release of statistics, our mechanism for releasing the threshold can be used in various other applications where a (privacy-preserving) measure of the scale of the input distribution is required.
我们提出一种差分隐私机制,用于展示一系列实值观测值的统计信息(例如移动平均值),其中每个观测值的界限要么过于保守,要么事先未知。这与实时数据监测和报告的场景特别相关,例如通过智能电表获取的能源数据。我们关注的是现实世界中的数据流,其分布是轻尾的,这意味着尾部趋近于零的速度至少与指数分布一样快。对于此类数据流,单个观测值预计会集中在一个未知阈值以下。从数据中估计这个阈值可能会侵犯隐私,因为它会揭示与个人相关的特定事件[1]。另一方面,一个过于保守的阈值可能会因添加不必要的噪声而影响准确性。我们构建了一种优化效用的差分隐私机制,根据输入流来发布这个阈值。与现有最先进算法相比,我们的主要优势在于,添加到数据流每个观测值上的噪声是根据阈值进行缩放的,而不是根据一个可能大得多的界限;当差异显著时,这会在效用上带来相当大的提升。通过使用两个现实世界的数据集,我们证明我们的机制在第一个数据集上平均将效用提高了3.5倍,在另一个数据集上提高了9倍。虽然我们主要关注的是统计信息的持续发布,但我们发布阈值的机制可用于其他各种需要(隐私保护的)输入分布规模度量的应用中。