A data governance framework that improves anonymization of datasets while maintaining original information
The success of deep learning partially benefits from the availability of various large-scale datasets. These datasets are often crowd-sourced from individual users and contain private information like gender, age, etc. The emerging privacy concerns from users on data sharing hinder the generation or use of crowdsourcing datasets and lead to hunger of training data for new deep learning applications. One existing solution is to preprocess the raw data to extract features at the user-side, and then only the extracted features will be sent to the data collector. Unfortunately, attackers can still exploit these extracted features to train an adversary classifier to infer private attributes. Other solutions leverage game theory to protect private attributes. However, these defenses are designed for known primary learning tasks, the extracted features work poorly for unknown learning tasks. Hence, there is an ongoing need for improved anonymization of datasets containing personally identifiable information.
Duke inventors have reported a data governance framework intended to remove privacy information and maximize the original information conserved. This can be applied for better training deep-learning architects. Specifically, this is a task-independent privacy-respecting data crowdsourcing framework (TIPRDC) with anonymized intermediate representation. By applying TIPRDC, a user can locally extract features from the raw data using the learned feature extractor, and the data collector will acquire the extracted features only to train a DNN model for the primary learning tasks. The technology has been extensively evaluated and compared with existing methods using two image datasets and one text dataset. The results show that TIPRDC substantially outperforms existing methods.
- A general framework that does not depend on any specific deep learning models, algorithm, or platform
- Evaluation on three benchmark datasets show that TIPRDC attains a better privacy-utility tradeoff than existing solutions
- Practicability has been demonstrated with cross-dataset evaluations on CelebA and LFW showing transferability of TIPRDC