Synthetic data generation (SDG) can help solve data access and availability issues in the life sciences and other industries. Our synthetic data technology addresses privacy challenges by allowing data to be shared more freely, quickly and cost-effectively, and by enabling amplification and augmentation of existing data.
Healthcare and other industries benefit from SDG for machine learning, software testing, education and training, among other applications. The Replica Synthesis toolbox covers many approaches to match the data complexity, from defining cohorts, longitudinal data synthesis, full and partial synthesis, and developing and sharing data simulators. These industries can make innovative data-driven decisions using SDG technology without compromising privacy.
Machine learning models must be trained on significant volumes of data which can be difficult to obtain. Machine learning models can be vastly improved when they are amplified with synthetic data. Machine learning models built using synthetic data are more easily shared and also less vulnerable to privacy attacks as the data used to train them is non-identifiable. As synthetic datasets are realistic, they can be used for model evaluation instead of real data, reducing privacy risks. Machine learning models can become more fair and inclusive when the datasets used to train them are amplified with synthetic data when, for example, a particular group is under-represented.
In highly regulated environments, every change to a software application needs to go through a formal validation process. Synthetic data is beneficial for software testing and data engineering as it reduces the risk of privacy violations, requires fewer regulatory constraints, and produces realistic datasets with higher utility, compared to currently available anonymized data. In addition, the risk of exposing sensitive or identifying information is significantly reduced. The Norwegian data protection authority investigating a major data breach stated that the privacy breach could have been avoided if synthetic data had been used.
There is also an opportunity to create a library of synthetic datasets available to software teams on-demand and evaluate potential external technology partners with minimal friction.
Synthetic data can be used as a training method when students or employees learn how to manage patients’ personal data. Current anonymized datasets are too distorted for effective teaching, and many are inaccessible. SDG produces readily available, realistic, reusable and high granularity datasets to improve the learning experience and development of required skills. SDG can also be used in hackathons that bring various people together to address real-world problems.
Share, reuse, and retain datasets with synthetic data generation without compromising global privacy requirements. As a result of rules and costs associated with retaining data, data records must be disposed of when they are no longer needed for the original purposes. Organizations can maintain data utility by applying synthetic data techniques to real datasets.
Organizations wanting to evaluate new data-centric technologies developed by third parties often face significant limitations, as evaluations must be performed with realistic datasets. However, providing access can take months of extensive contracting and auditing. The time constraints limit how quickly technology evaluations can be performed. With SDG technology, readily available synthetic datasets can be used and vendor evaluations are scalable, realistic and efficient.
Much can be learned from the re-analysis of existing data, however access to internal datasets for exploratory and detailed analytics faces significant friction due to GDPR requirements. Existing anonymization techniques distort datasets too much and take too long to produce results. Using SDG, high-quality datasets are reused more easily across the organization. Scientists can use synthetic data to determine the exact data they require at the exploration stages. Synthetic data generated from clinical trial data can be a proxy for real data because of their similarity. Smaller existing datasets can be augmented or amplified to create virtual patients.
SDG technology can help address various external data sharing and access challenges, for example in research and development, as well as open science. Organizations can for example share Electronic Health Record (EHR) data between and across internal and external partners, such as academics and life sciences companies with SDG technology. Synthetic data allows for ease of internal secondary use and external sharing by increasing transparency and information-sharing and accelerating access to complex datasets to solve real world global issues