Generating Synthetic Tabular Data That’s Differentially Private

Abstract

While generative models are able to produce synthetic datasets that preserve the statistical qualities of the training dataset without identifying any particular record in the training dataset, most generative models to date do not offer mathematical guarantees of privacy that can be used to facilitate information sharing or publishing. Without those mathematical guarantees, a lot of work is needed to ensure that adversarial attacks on the models and the synthetic data they generate are thwarted.

We can never be sure of what attacks might become feasible in the future. Further, it is ineffective to defend against privacy attacks reactively once they have already occurred. This is exactly the problem that differential privacy (DP) solves by bounding the probability that a bad event occurs. By introducing calibrated noise into an algorithm, DP provides the defense against all future privacy attacks with a high probability.

In this session, we’ll explore approaches to applying differential privacy, including one that relies on measuring low dimensional distributions in a dataset combined with learning a graphical model representation. We’ll end with a preview of Gretel’s new generative model that applies this method to create high-quality synthetic tabular data that is differentially private.

Stop Thinking, Just Do!