Intelligent Data Masking: Using GANs to Generate Synthetic Data for Privacy-Preserving Analytics
Muniraju Hullurappa
Lead Data Engineer, Department of Data Analytics and Information Technology, System Soft Technologies, Dallas, Texas, USA
Download PDFAbstract
Protecting sensitive information while enabling data-driven insights is a significant challenge in the age of big data. Advanced data analytics and artificial intelligence have brought a growing dilemma for organizations: balancing data utility with stringent privacy requirements. Traditional data anonymization techniques often result in a significant loss of information, hindering the ability to draw meaningful insights. GANs thus pose a revolutionary alternative way to address the problem by generating synthetic data that retains the original dataset's statistical properties while respecting sensitive information. This paper discusses GANs as intelligent data masks for producing high-quality synthetic data to support privacy-preserving analytical goals. The proposed framework describes methodologies for preprocessing the data, GAN architecture, and evaluation metrics tailored toward privacy and utility aspects. It is experimentally evaluated on benchmark datasets with traditional anonymization methods as comparison benchmarks. The results indicate that GANs achieve the best balance between data utility and privacy, significantly reducing re-identification risks while maintaining high utility for machine learning tasks. In addition, the work presents practical applications in healthcare, finance, and marketing, establishing GANs as a promising solution for privacy-preserving analytics across diverse domains.
Keywords: Generative Adversarial Networks (GANs); Synthetic Data Generation; Privacy-Preserving Analytics; Data Anonymization; Data Utility; Re-identification Risk Reduction.
- L. Sweeney, 'k-Anonymity: A Model for Protecting Privacy,' Int. J. Uncertain. Fuzziness Knowl.-Based Syst., vol. 10, no. 5, pp. 557-570, 2002.
- I. Goodfellow et al., 'Generative Adversarial Nets,' in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2672-2680.
- C. Ledig et al., 'Photo-Realistic Single Image Super-Resolution Using a GAN,' in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4681-4690.
- Y. Hu et al., 'GAN-Based Text Generation,' in IEEE Trans. Knowl. Data Eng., vol. 30, no. 1, pp. 1-10, 2018.
- S. Choi et al., 'Generating Multi-dimensional Time-Series Data Using GANs,' in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5800-5810.
- C. Dwork, 'Differential Privacy,' in Automata, Languages and Programming, 2006, pp. 1-12.
- R. Agrawal and R. Srikant, 'Privacy-Preserving Data Mining,' in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2000, pp. 439-450.
- A. Narayanan and V. Shmatikov, 'Robust De-anonymization of Large Sparse Datasets,' in Proc. IEEE Symp. Secur. Privacy, 2008, pp. 111-125.
- M. Arjovsky et al., 'Wasserstein GAN,' in Proc. Int. Conf. Mach. Learn., 2017, pp. 214-223.
- T. Salimans et al., 'Improved Techniques for Training GANs,' in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 2234-2242.
- UCI Machine Learning Repository. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/adult.
- MIMIC-III Clinical Database. [Online]. Available: https://mimic.physionet.org/.
- Kaggle Credit Card Transactions Dataset. [Online]. Available: https://www.kaggle.com/mlg-ulb/creditcardfraud.