With the rapid growth of synthetic data generation (SDG) as a modern privacy enhancing technology (PET), we are often asked how it can be used in practice. A new paper published this week in the latest edition of the international journal, Discover Artificial Intelligence, explores 7 concrete and interconnected use cases where the pharmaceutical industry and others are beginning to see the great benefits of synthetic data generation (SDG). We’ve summarized them as follows:
- Machine learning: Machine learning models can become more fair and inclusive when the datasets used to train them are amplified with synthetic data when, for example, a particular group is under-represented. Machine learning models constructed using synthetic as opposed to real data are also less vulnerable to privacy attacks.
- Internal software testing: Synthetic data can be used for data engineering and software testing. In fact, the Norwegian data protection authority that investigated a major data breach stated that the organization could have avoided the privacy breach, and a hefty GDPR fine, if it had used synthetic data.
- Education, training and hackathons: When employees need to learn how to handle patients’ personal data, they can first be trained using synthetic data. SDG can also be used in hackathons that bring various people together to address real-world problems.
- Data retention: Due to rules and costs associated with retaining data, they often need to be disposed of when they are no longer required for the original purposes. Applying data synthesis techniques can help organizations maintain data utility.
- Vendor assessment and sharing data with third party services: Organizations wanting to leverage innovative technologies from third parties have to jump through significant hoops, long audits and extensive contracting because the proposed technologies are assessed on their own data sets. Instead, these new technologies can be assessed more safely and efficiently using SDG.
- Internal secondary use: It’s a best practice to use certain PETs to transform data so they can be used for secondary research. Scientists can use synthetic data at the exploration stages to determine what data they need. And synthetic data generated from clinical trial data can be used as a proxy for real data in secondary analysis, given the similarity shown between the two.
- External sharing: SDG can help increase transparency and information-sharing within the pharmaceutical industry, and accelerate access to data, alleviating some of the current challenges in sharing real data between and across jurisdictions.
We encourage you to read the full article here. It goes into much more detail on the different use cases, depending on requirements, and it provides some examples of organizations that are embracing SDG – for different purposes and to varying degrees.
It’s also worth taking note of the authors’ views on what may be needed to support the further adoption and future direction of data synthesis in the pharmaceutical industry. This includes increased awareness of the data quality and privacy assurances of synthetic data as non-identifiable data; industry guidelines for its use; having the right skills, and technical and organizational safeguards in place; and a growing acceptance by industry, regulators and the scientific and medical communities of the technology and the results it produces. Some of what the authors point to is echoed in recommendations we have made about regulating non-identifiable data.
The good news is that we are seeing progress on all these fronts, as synthetic data continues to become a high-utility, privacy-friendly, innovative and realistic solution for these 7 use cases above.